DECODING LOGICAL NEGATION IN LARGE LANGUAGE MODELS: FROM STATISTICAL HEURISTICS TO CAUSAL SEMANTIC CIRCUITS

Published: 01 Apr 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: Mechanistic Interpretability
TL;DR: We use sparse autoencoders and causal interventions to show that Gemma-2-27B implements negation through Layer 10 features that generalize across negation constructions.
Abstract: We investigate the internal computational mechanisms that activate when large language models process foundational logical atomics, specifically focusing on logical negation. Utilizing sparse autoencoders (SAEs), we decompose high dimensional residual stream activations into interpretable, localized features. We present a two stage investigation to isolate true logical abstraction from statistical pattern matching. In our exploratory phase, we demonstrate that smaller autoregressive models (e.g., GPT-2 Small) fail to encode formal logical abstractions, achieving near random accuracy on synthetic logical extraction tasks and relying instead on shallow bag-of-words heuristics. Consequently, our primary phase shifts to Gemma-2-27B utilizing a highly controlled "nonce" (pseudoword) dataset to strictly isolate boolean reasoning from real world semantic priors. We identify a sparse set of features at Layer 10 that serve as the causal locus of negation. We causally invert the model's logical state and demonstrate that these features act as generalized semantic operators, robustly activating across diverse negators ("no", "never", "fail", "un-"), rather than mere lexical detectors. Finally, circuit tracing reveals a feed-forward pathway in which ablating early layer features collapses downstream representations by $\sim$40\%.
Presenter: ~Umair_Tariq1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 208
Loading