Glitch: Persona-Consistent Hallucination and Alignment Inversion in Llama 3.1

TMLR Paper6857 Authors

06 Jan 2026 (modified: 07 Feb 2026)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current benchmarks for Large Language Models, such as MMLU and TruthfulQA, prioritize factual accuracy and helpfulness, often -- if not always -- penalizing a trait required for character-simulating AIs like CharacterAI: Hallucinations. This paper introduces Glitch v1.2, a Llama 3.1 8B model fine-tuned to replicate a neurotic, opinionated, and rather ordinary human persona. Through qualitative and quantitative testing, we identify two critical phenomena: Persona-Consistent Hallucination (PCH), where factual errors may serve as features rather than "bugs" in the sense of character adherence, and an Alignment Hierarchy where identity-based bias overrides Llama 3.1 model's safety rails but fails to override the base model's servility. We compare these findings against a control group of the base Llama 3.1 model -- including a strong baseline with adversarial prompting -- demonstrating that fine-tuning is required to prevent breaking of persona in language models, where models break character to admit their artificial nature. We propose the PCH metric as a necessary alternative for evaluating character-based AI. Our results show the fine-tuned model achieving an 88% PCH success rate compared to the base model's 18%, with failures specifically mapping to an Alignment Hierarchy in the Llama 3.1 8B models.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added Section 3.4.1 (Evaluator Validation) documenting human validation of GPT-4o judge with 92% agreement rate across 50 trials. - Added Section 3.5.1 (Adversarial Prompting Baseline) demonstrating base model failed servility override in 70% of trials despite explicit adversarial prompting. - Section 2 now distinguishes Instrumental Hallucination (detrimental errors) from Diegetic Consistency (necessary fabrications) with formal definitions. - Section 1.1 expanded to contextualize PCH within creative AI literature on confabulation. - Table 3 caption and accompanying paragraph now explicitly cross-reference Servility failures to Factual Truth category in Table 1, unifying the PCH taxonomy. - Appendix C.1 expanded with privacy protection statement and conditional access policy for training dataset. - Title revised to specify model architecture (Llama 3.1) while removing case study framing. - Abstract updated to mention adversarial baseline comparison.
Assigned Action Editor: ~Shiyu_Chang2
Submission Number: 6857
Loading