Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability
Keywords: Interpretability tooling and software, Circuit analysis
TL;DR: We show that all LayerNorm layers can be removed from GPT-2 models via fine-tuning with minimal performance loss
Abstract: Layer-wise normalization (LN) is an essential component of virtually all
transformer-based large language models. While its effects on training stability are
well documented, its role at inference time is poorly understood. Additionally, LN
layers hinder mechanistic interpretability by introducing additional nonlinearities
and increasing the interconnectedness of individual model components. Here, we
show that all LN layers can be removed via fine-tuning from every GPT-2 model
with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-
2 XL). Thus, LN is not essential at inference to maintain comparable performance
in language modeling. We find that the amount of fine-tuning data needed for LN
removal grows sublinearly with model parameters, suggesting scaling to larger
models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face.
Furthermore, we test interpretability techniques on LN-free models. Direct logit
attribution now gives the exact direct effect of individual components, while the
accuracy of attribution patching does not significantly improve. We also confirm
that GPT-2’s “confidence neurons” are inactive in the LN-free models. Our work
clarifies the role of LN layers in language modeling, showing that GPT-2-class
models can function without LN layers. We hope that our LN-free analogs of the
GPT-2 family of models will enable more precise interpretability research and
improve our understanding of language models.
Submission Number: 160
Loading