Abstract: Many evaluations of Large Language Models (LLMs) focus on exam-style benchmarks that measure domain-specific knowledge acquisition or linguistic attributes like grammaticality. Such evaluations emphasize the functional capacities of LLMs while overlooking their ability to resonate with readers on a psychologically deep level. Addressing this gap, this work introduces the Psychological Depth Scale (PDS), a novel framework designed to measure authenticity, empathy, engagement, narrative complexity, and emotional provocation. Through an empirical study involving 100 short stories written by humans and various LLMs, including GPT-4, we explore the consistency of human judgment on psychological depth, compare the depth of human and LLM stories, and examine the potential for automated assessment of psychological depth. Our findings reveal that (1) humans can consistently judge psychological depth despite its abstract nature; (2) despite being perceived as less "human", GPT-4 stories surpassed advanced human authors in 4 out of 5 dimensions of psychological depth, often by sizable margins; and (3) GPT-4 combined with a novel Mixture of Personas prompting strategy can moderately correlate (0.44) with human judgments of psychological depth. These findings open the possibility that LLMs could be strategically deployed to forge deeper emotional and psychological bonds with humans in fields as diverse as therapy and popular entertainment.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
0 Replies
Loading