Numerical Fragility in Transformers: A Layer-wise Theory for Risk Estimation and Selective Stabilization
TL;DR: We develop a layer-wise forward-error theory for low-precision Transformers that explains, predicts, and mitigates numerical instability.
Abstract: Low-precision execution can induce substantial forward discrepancies in Transformers even for fixed weights and input, yet these discrepancies are usually monitored only at the output and lack a layer-wise theoretical account. We develop a first-order decomposition of output mismatch into layer-local attention, LayerNorm, and residual-transport terms, and derive from it a practical causal risk estimator and a budgeted controller, Bound-Guided Selective Stabilization (BGSS). Controlled sweeps verify the predicted local sign, monotonicity, and transport structure. On GPT-2, the transport-aware combined predictor is positively correlated with FP32-reference mismatch in all 18 runs and improves over a no-transport ablation in 17/18 runs. Reference-patch attribution shows that the same score preserves useful layer ordering information (mean Spearman 0.362). In budget-matched mitigation, BGSS outperforms random same-budget control in onset events (10.67 vs. 11.67), final mismatch (0.001243 vs. 0.001284), and worst-case mismatch (0.00314 vs. 0.00849), while matching a risk-only same-budget controller on onset suppression and sharply reducing worst-case mismatch (0.00314 vs. 0.00571). These results support a theory-to-algorithm account of Transformer numerical fragility in which finite-precision risk can be analyzed, estimated, localized, and selectively stabilized.
Code Dataset Promise: Yes
Code Dataset Url: https://github.com/JinwooBaek00/Numerical-Fragility-in-Transformers
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 2273
Loading