Reinforcement Learning in Language Models Recruits a Shared Functional Welfare Axis

Andy Q Han; David J. Chalmers; Pavel Izmailov

Reinforcement Learning in Language Models Recruits a Shared Functional Welfare Axis

Andy Q Han, David J. Chalmers, Pavel Izmailov

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Interpretability for AI Safety

Other Keywords: reinforcement learning, ai welfare

TL;DR: By comparing concept vectors before and after training, we show that RL, even on an affectively neutral reward signal, recruits a representation of *functional welfare*: an estimate of how well or badly the system is doing, relative to its goals.

Abstract: How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a representation of *functional welfare* that already exists in the base model: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment, extract concept vectors for rewarded and punished trajectories, and evaluate those vectors on tasks unrelated to the maze. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust across model families and environmental controls, and largely persist when we replace RL with supervised fine-tuning. Importantly, these effects appear in the models before any maze training. Therefore, we argue that this functional welfare axis is pre-existing in the model, rather than being created by reinforcement learning. While we make no claims about any experience of welfare, the axis offers a demonstration of how minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

Submission Number: 223

Loading