Keywords: Machine Learning, LLMs, AI Safety, Activation Steering, Prompt Inversion
TL;DR: We formally show that there exist no prompts that can elicit the same internal behavior that is achieved through activation steering, hence argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
Abstract: Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior (like personas). It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating internal activations into human-readable explanations Pan et al. (2024)) and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model’s natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 70
Loading