Keywords: Large Language Models, Bayesian Models, In-Context Learning, Interpretability
Abstract: Large language models (LLMs) can be controlled through prompts (in-context learning) and internal activations (activation steering), but a unified theory explaining these methods is lacking, making their application often reliant on trial-and-error. Here, we develop a unifying predictive account of LLM control from a Bayesian perspective, proposing that both context- and activation-based interventions impact behavior by shifting the \textit{model's belief in latent concepts}. Under our framework, steering operates by shifting concept priors, and in-context learning leads to an accumulation of evidence. This theory predicts three key phenomena we verify empirically: (i) sigmoidal learning curves as in-context evidence accumulates, (ii) predictable shifts of these curves with activation steering, and (iii) additive effects of both interventions, creating distinct behavioral phases. Our framework yields a closed-form model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of five domains inspired by prior work on many-shot in-context learning. Crucially, this model also predicts the precise crossover boundaries where these interventions trigger sudden behavioral shifts. Taken together, our framework offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.
Primary Area: interpretability and explainable AI
Submission Number: 23195
Loading