Towards stable belief LLMs

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: LLM beliefs, Non-Factual Beliefs, Model Editing, Activation Steering, Belief Injection, AI Safety, Human Simulation
TL;DR: We study methods for injecting non-factual beliefs in LLMs and propose a framework to evaluate the stability and depth of these beliefs.
Abstract: Large language models (LLMs) are increasingly being used as proxies for human respondents in social science research, survey simulation, and behavioral simulation. A key desideratum for such use is that models exhibit stable and coherent beliefs, rather than merely producing plausible-sounding outputs that shift under pressure or with changes in phrasing. While recent work has studied how to implant and measure factual beliefs in LLMs, the analogous problem for non-factual beliefs remains largely underexplored. Understanding how to represent and maintain stable non-factual beliefs in language models is an important step toward developing models whose responses remain consistent across contexts. This stability is crucial if LLMs are to be reliably used to simulate human attitudes and behaviors in downstream applications. This ongoing work takes a first step toward addressing this gap.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 148
Loading