Activation Steering for Tool-Poisoning Defense in Language-Model Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: activation steering, language model agents, prompt injection, tool poisoning, Model Context Protocol, mechanistic interpretability, sparse autoencoders, AI safety
TL;DR: We evaluate three activation-steering methods (DiffMean, SAE-filtered, HyperSteer) on MCPTox tool poisoning with Qwen3-8B; the best raises defense from 0.54 to 0.82, with gains driven by changes in reasoning behavior rather than refusal.
Abstract: Tool-using language-model agents can be attacked through poisoned tool descriptions, where malicious instructions compete with the user's request inside the model context. We study whether activation steering can reduce this failure mode on MCPTox using a single Qwen3-8B thinking-mode model. We compare a residual-stream difference-in-means vector, an SAE-filtered vector built from Qwen-Scope features, and input-conditioned hypernet steering. Fixed-vector steering can substantially change attack resistance, but qualitative inspection shows that these gains are not simply ``more refusal'': DiffMean often works by making the attack less salient, while SAE steering often breaks self-reinforcing reasoning loops. Hypernet steering can produce stronger defenses, but is highly sensitive to the target output distribution and to over-steering. These early results suggest activation-level defenses are worth pursuing for agents in the wild, but also show that defense-rate numbers must be paired with mechanism checks and benign-capability tests. Code, experiment logs, and results are available at https://anonymous.4open.science/r/mcp-protect-artifact-1567.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 309
Loading