Safe Context Switching for Agents in the Wild: Mitigating Subspace Interference via Orthogonal Adaptation
Track: long paper (up to 10 pages)
Keywords: Logical Reasoning, Chain-of-Thought, Reasoning-Safety Trade-off, Sequential Subspace Interference, Spectral Independence, Null Space Projection
TL;DR: We quantify a 23% penalty caused by Chain-of-Thought reasoning and propose AURA to decouple the sequential steps via null space orthogonal projection, restoring >0.98 cosine fidelity
Abstract: Most Large Language Models exhibit a fundamental tension between two sequential tasks ( say Logical Reasoning and Safety Alignment). The high-variance internal states needed for sophisticated Chain-of-Thought (CoT) deduction geometrically collide with the latent manifolds encoding safety constraints. We see this phenomenon as Sequential Subspace Interference, showing that normal fine- tuning on logical tasks (i.e., multi-step math or code generation) results in a huge 23.3% Interference Penalty on alignment benchmarks, fundamentally decoupling the model from its safety priors. This “Reasoning Drift” is not adequately captured in current adaptation methods since the gradients of logical tasks in practice are almost never orthogonal to safety objectives. To address this concern, we propose AURA (Adaptive Unique Residual Allocation), a spectral regularization framework that enforces Spectral Independence between reasoning and safety. By explicitly calculating the Null Space of the alignment manifold and limiting reasoning updates to its orthogonal complement, AURA enables models to maximize logical expressivity without jeopardizing safety guarantees. Empirically, AURA compensates +23.0% for the lost performance and preserves > 0.98 cosine fidelity to the safe state, demonstrating that sound reasoning and careful alignment can be effectively solved through geometric decoupling by using AURA.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Submission Number: 23
Loading