Glitch: Persona-Consistent Hallucination and Alignment Inversion in Parameter-Efficient Fine-Tuning

TMLR Paper6857 Authors

06 Jan 2026 (modified: 18 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Current benchmarks for Large Language Models, such as MMLU and TruthfulQA, prioritize factual accuracy and helpfulness, often if not always penalizing a trait required for character-simulating AIs like CharacterAI: Hallucinations. This paper introduces Glitch v1.2, a Llama 3.1 8B model fine-tuned to replicate a neurotic, opinionated, and rather ordinary human persona. Through qualitative and quantitative testing, we identify two critical phenomena: Persona-Consistent Hallucination (PCH), where factual errors may serve as features rather than "bugs" in the sense of character adherence and an Alignment Hierarchy where identity-based bias overrides Llama 3.1 model's safety rails but fails to override the base model's servility. We compare these findings against a control group of the base Llama 3.1 model, demonstrating that fine-tuning is required to prevent breaking of persona in language models, where models break character to admit their artificial nature. We propose the PCH metric as a necessary alternative for evaluating character-based AI. Our results show the fine-tuned model achieving an 88% PCH success rate compared to the base model's 18%, with failures specifically mapping to an Alignment Hierarchy in the Llama 3.1 8B models.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Shiyu_Chang2
Submission Number: 6857
Loading