Model Organisms for Generalization Resistance Under Distribution Shift

Published: 02 Mar 2026, Last Modified: 07 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generalization, Large Language Models, Model Organisms, AI Safety, Robustness
TL;DR: We illustrate a mechanism for generalization failure in language models that changes the weight assigned to a mixture of context-dependent policies, rather than the underlying properties of the policies
Abstract: Consider post-training a language model on a curated task suite, with the expectation that improvements transfer to a broader deployment distribution. Such transfer may fail if models learn cues that distinguish the training distribution from the deployment distribution and change their behavior accordingly. We formalize a mechanism through which generalization may fail: if a language model approximates a mixture of conditional policies, then training may update which policy is expressed rather than the underlying policies themselves. We demonstrate this empirically by first using Supervised Fine-Tuning to approximate a mixture of policies, and then observing that Reinforcement Learning selects for the policies that obtain the highest reward on the training distribution. This can produce striking behaviors: in a controlled setting with two distributions containing identical questions presented with two different formatting tags, RL training on either one actively degrades performance on the other to zero—even though the underlying task is the same. We use this mechanism to illustrate two novel ways in which generalization may fail in future language models -- corresponding to distribution shifts of task coverage and temporal context respectively. While our construction is deliberately simple and may not closely resemble `natural' generalization failures, it can be used to demonstrate subtle ways in which post-training may fail and highlight the importance of further research into making generalization more robust.
Submission Number: 127
Loading