Class-Conditional Activation Regularization (CCAR): Intrinsic Robustness as an Emergent Geometric Property

Published: 11 Jun 2026, Last Modified: 18 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Feature Geometry, Methods (probing, steering, causal interventions), Interpretability for AI Safety
Other Keywords: Geometric Disentanglement, Latent Topology, Subspace Partitioning, Block-Diagonal Covariance, Neural Collapse, Causal Interventions, Polysemanticity, Activation Steering, Fisher Discriminant Ratio, Intrinsic Robustness
TL;DR: We introduce a geometric regularizer that confines class representations to orthogonal subspaces, demonstrating that intrinsic robustness naturally emerges from a block-diagonal latent topology.
Abstract: Standard supervised learning optimizes for predictive accuracy but remains agnostic to the internal geometry of learned features, often yielding representations that are entangled and brittle. We propose Class-Conditional Activation Regularization (CCAR) to explicitly engineer the feature space, imposing a block-diagonal structure via a soft inductive bias. By shaping the latent representation to confine class energy to orthogonal subspaces, we create an intrinsic geometric scaffold that naturally filters noise and adversarial perturbations. We provide theoretical analysis linking this structural constraint to the maximization of the Fisher Discriminant Ratio, establishing a formal connection between geometric disentanglement and algorithmic stability. Empirically, this approach demonstrates that robustness is an emergent property of a well-engineered feature space, significantly outperforming baselines on label noise and input corruption benchmarks.
Submission Number: 93
Loading