Post-Gating Bias: Restoring Affine Freedom in Transformer MLPs

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: post-gating bias (PGB), gated MLPs, transformer VAEs, training stability
TL;DR: Post-Gating Bias restores affine freedom lost in gated MLPs. By softening activation boundaries under noise from dropout or latent regularization, it sometimes stabilizes training or improves robustness at negligible cost.
Abstract: Modern transformers often omit additive biases because normalization and attention preserve constant offsets that downstream linear maps can remap. The gated MLP (SwiGLU) is a notable exception: the elementwise product with a nonlinear gate destroys such offsets, removing affine freedom after the nonlinearity. We examine a simple modification—Post-Gating Bias (PGB), an additive term applied after the gated product and before the down-projection. PGB restores this degree of freedom with negligible computational cost. Our working hypothesis is that PGB mitigates training noise from dropout or from additive stochastic regularization (e.g., in VAEs) by shifting activation boundaries in a controlled way, thereby softening sharp transitions that otherwise amplify perturbations. We observe stability gains at higher learning rates and some robustness improvements in a ViT-VAE setting. We also show other settings where the effect is minimal. We present these observations to clarify where biases are redundant in transformer blocks and where multiplicative gating makes them potentially useful. Finally, we report a controlled study varying dropout and latent noise across multiple seeds to test this hypothesis.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 24345
Loading