Concept Activation Regions for Multi-Concept Activation and (Dis)Entanglement in Large Language Models

Concept Activation Regions for Multi-Concept Activation and (Dis)Entanglement in Large Language Models

TMLR Paper7143 Authors

24 Jan 2026 (modified: 09 Jun 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work extends the Bias-CAV framework by introducing a geometric perspective on multi-concept activations and their entanglement in large language models. Rather than treating concepts as single directions, the framework reframes them as probe-dependent activation regions---level sets of learned classifiers---and introduces Multi-Concept Activation Subspaces (MCAS) to jointly model multiple bias-related concepts. A central distinction is drawn between \emph{directional entanglement} (alignment of concept directions, reducible by orthogonalization) and \emph{measure entanglement} (activation distribution overlap, which may persist due to data correlations). Empirically, these two metrics are weakly correlated ($r = 0.33$, $\rho_{\text{dir}}$ explains only 11\% of $\rho_{\text{mass}}$ variance), confirming they capture substantially different information about concept relationships. Conditional disentanglement methods are developed to operationalize partial concept separation via orthogonal projection, achieving cross-concept sensitivity reductions of 2--15\% (AUC-based). MCAS-based interventions constrained to learned subspaces achieve comparable bias reduction to single-direction baselines while reducing cross-concept spillover by 5--10$\times$, as measured both by activation-space geometry (Experiment~4) and direct output-level evaluation via hooked forward passes (Experiment~7). Layer-wise entanglement patterns reveal architecture-dependent trajectories: encoder models accumulate entanglement from orthogonal embeddings ($\rho_{\text{dir}}^{(1)} = 0.00$) to substantial alignment ($\rho_{\text{dir}}^{(L)} = 0.35$--$0.70$), while decoder models begin with pre-existing entanglement. Three-concept intersectional analysis confirms that $\sim$40\% measure entanglement persists after complete directional orthogonalization at the median threshold (ranging from $\sim$74\% at permissive thresholds to $\sim$0\% at restrictive thresholds), consistent with fundamental data correlations. For practitioners, the framework provides methods for analyzing intersectional bias patterns and improving attribution clarity through conditional disentanglement, even when full concept separation is not achievable.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Chao_Chen1

Submission Number: 7143

Loading