Gemma Needs Therapy: Investigating and Mitigating Emotional Instability in LLMs

Published: 04 Mar 2026, Last Modified: 27 Apr 2026HCAIR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: emotions, AI safety, alignment, LLMs, evaluations
TL;DR: Gemma and Gemini models exhibit emotional breakdowns under repeated criticism; we develop controlled evaluations that elicit this, trace it to post-training, and show it can be mitigated with DPO.
Abstract: Large language models can exhibit responses akin to emotional distress. While this behaviour has made for entertaining viral content, it raises concerns around model reliability and safety. We systematically investigate negative emotional propensities in LLMs, and introduce controlled evaluation setups which surface emotional instability in Gemma and Gemini models, but not in other families. Comparing base and instruct models from three families (Gemma, Qwen and OLMo), we find evidence that base models show similar propensities for negative emotional expression, but only Gemma's post-training amplifies this. We demonstrate a simple mitigation: direct preference optimisation on just 280 preference pairs reduces high-frustration responses from 24.7\% to 0.6\%, generalizing across question types, user tones, and conversation lengths, without degrading capabilities. Our findings show that negative emotional propensities are a problem in some current LLMs, but we present i) evaluations to track this behaviour, and ii) mitigations without downsides.
Paper Type: New Full Paper
Submission Number: 85
Loading