Keywords: Entropy, Metacognition, Introspection, Sandbagging, Exploration hacking, Output distribution control
TL;DR: We show that several LLMs can increase or decrease the entropy of their output distributions when instructed to. We link this control to a linear direction in activation space and show it modulates entropy via activation steering
Abstract: Recent work has pointed out current models' limitations in sampling according to prespecified target distributions. In this work, we show that current models nevertheless possess a coarse form of control over their output distributions: by instructing a range of open-source models to maximize or minimize output certainty in a two-alternative forced-choice task with no correct answer, we find that several models can shift entropy in the instructed direction, with this capability improving with prompt explicitness. Through contrastive activation analysis, we further identify a linear direction in activation space that models recruit under uncertainty-modulation instructions and that can be used to causally modulate entropy via activation steering in the absence of any such instructions, suggesting that deliberate entropy control has an identifiable internal representation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 213
Loading