A Late-Layer MLP Arbitrates Answer-or-Defer Decisions in Autoregressive Transformers

18 Nov 2025 (modified: 04 Jan 2026)AAAI 2026 Workshop NeusymBridge Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: tool call mechanism
TL;DR: We locate and causally validate a compact late-layer MLP that arbitrates the answer-or-defer choice at the decision token scaling across Pythia and observed in Llama-3-8B and Phi-3
Abstract: Large language models must repeatedly decide whether to answer directly from internal knowledge or to defer toward tool like behavior. We study this \emph{answer–versus–defer} arbitration in autoregressive transformers and test the hypothesis that a compact late layer module influences this choice. Using position aligned causal patching at the decision token, we measure a \textit{decision margin} the logit difference between the canonical answer's first token and a small, fixed set of deferral trigger tokens to quantify this bias. Across the \textbf{Pythia} scaling suite (2.8B, 6.9B, 12B), We observe indications of a late layer MLP subcomponent emerging consistently whose clean minus random corrected effect peaks near the top of the stack and shifts later with scale. Additive interventions at this layer reliably increase the decision margin toward answering, while placebo edits and alternate token sets yield near zero change. Qualitatively similar late layer localization and MLP over head dominance are observed in \textbf{LLaMA-3-8B} and \textbf{Phi-3-mini}, suggesting that this arbitration motif generalizes across model families. Our findings are consistent with the view that late MLPs may encode compact, confidence like signals influencing immediate behavioral choices, offering a reproducible mechanistic handle for analyzing and steering answer defer dynamics.
Submission Number: 6
Loading