Language Models Can Coarsely Modulate Entropy Under Instruction

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Entropy, Metacognition, Introspection, Sandbagging, Exploration hacking, Output distribution control
TL;DR: We show that several LLMs can increase or decrease the entropy of their output distributions when instructed to. We link this control to a linear direction in activation space and show it modulates entropy via activation steering
Abstract: Recent work has pointed out current models' limitations in sampling according to prespecified target distributions. In this work, we show that current models nevertheless possess a coarse form of control over their output distributions: by instructing a range of open-source models to maximize or minimize output certainty in a two-alternative forced-choice task with no correct answer, we find that several models can shift entropy in the instructed direction, with this capability improving with prompt explicitness. Through contrastive activation analysis, we further identify a linear direction in activation space that models recruit under uncertainty-modulation instructions and that can be used to causally modulate entropy via activation steering in the absence of any such instructions, suggesting that deliberate entropy control has an identifiable internal representation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 213
Loading