I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Published: 23 Sept 2025, Last Modified: 17 Feb 2026CogInterp @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: contextual inference, subliminal learning, LLM personas, LLM psychology, self-consistency, safety, collusion
TL;DR: We show that LLMs can make inference about what personas they should assume based on its past binary answers to semantically unrelated question.
Abstract: Large language models (LLMs) can achieve high performance in next token prediction (NTP) by performing contextual inference: inferring information about the generative process underlying text, and integrating it into predictions. When engaging in conversation by autoregressively sampling the most likely tokens of a simulated assistant's response, this process constitutes the assistant's persona. Post-training methods such as reinforcement learning from human feedback aim to constrain the persona of this simulacrum to be helpful and harmless. Yet this persona is also influenced by a drive for self-consistency; LLMs will act on personas consistent with behaviour displayed in their context. We demonstrate that LLMs can infer information about past personas from a set of nonsensical but innocuous questions and binary answers in context, and act upon them in safety-related questions. This is despite the questions bearing no semantic relationship to the target misalignment behaviours, and each answer providing only one bit of information. By matching these questions and only differing binary answers across transmitted personas, we isolate the effects of contextual persona inference and self-consistency from subliminal learning from token entanglement during training.
Submission Number: 71
Loading