The Narcissus Hypothesis: Descending to the Rung of Illusion

Published: 24 Sept 2025, Last Modified: 03 Oct 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: model collapse, machine behaviour, causality, AI alignement, epistemiology
Abstract: Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment—via human feedback and model-generated corpora—induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the _Narcissus Hypothesis_ and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl’s Ladder of Causality, culminating in what we refer to as the _Rung of Illusion_.
Submission Number: 65
Loading