A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

TMLR Paper9182 Authors

24 May 2026 (modified: 29 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We apply the Weibull distribution, a two-parameter family from extreme-value theory, as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give $|w| \sim \text{HalfNormal}$, which anchors the Weibull shape parameter at $k \approx 1.20$. This makes $k$ a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer enables diagnostics invisible to aggregate statistics. Applying this framework to 12 model entries spanning 7 architectural families reveals the following findings. First, FFN modules and the attention output projection (the Transmission Class) fall in a narrow $k$ band $[1.186, 1.204]$ across architectures (CV $= 0.51\%$). Second, the attention input projections (the Selection Class, $W_q$ and $W_k$) depart from this band in an architecture-dependent manner: separately-stored MHA shows the largest drift, grouped-query attention shows milder drift, and merged storage shows transitional behavior. The scale parameter $\lambda$ grows during training and tracks $\sqrt{\eta/\lambda_{wd}}$ as a within-family scaling trend in Pythia ($n = 5$ sizes). The two Weibull parameters carry independent information: $k$ labels the functional class, $\lambda$ labels training progress. The framework was further used to diagnose an 11-entry Qwen cohort: shallow-FFN layers exhibit bimodal weight distributions. We release npm-weibull-py v0.4 and DATABASE_v9_1 (anonymized for review; URLs in camera-ready).
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=3XYmdMUCfE
Changes Since Last Submission: The previous submission (8545) was desk-rejected on the grounds that "the argumentation, literature review, and experimental analysis are underdeveloped." This resubmission is a substantial rewrite addressing each criticism: 1. **Repositioned scope.** The previous draft framed the work as a "convergence law"; this version reframes it as a measurement framework with empirical findings (initialization anchor $k_0 \approx 1.20$, noise-optimal middle-80% trim protocol, anti-interference property) followed by cross-family evidence. Title and abstract are rewritten accordingly. 2. **Strengthened argumentation.** Section 1 now articulates problem-motivation-contributions explicitly. Section 2 develops the framework's theoretical basis (Half-Normal initialization, sampling-noise variance argument for the middle-80% trim) before presenting any findings. 3. **Expanded experimental analysis.** The cohort is expanded to 12 entries across 7 architectural families spanning two orders of magnitude in parameter count. The Transmission Class $k$-band claim is supported by CV = 0.51% across 12 entries with $\Gamma$-closure consistency check on 837 FFN fits. A new within-entry diagnostic case study (Appendix C, 11-entry Qwen cohort) demonstrates the framework in actual use. 4. **Literature review.** Section 1 positions the element-wise $|W|$ framework relative to orthogonal spectral approaches (HT-SR, WeightWatcher, AlphaDecay, OrthoAdam). The appendix further situates the framework in relation to activation-side phenomena (super-weight, massive activation, secondary attention sinks in Wong et al. 2025), gradient-side findings (PAGE in Zhang et al. 2026, dead neurons in Voita et al. 2023), and AdamW steady-state scaling work (Fan et al. 2025), and distinguishes the framework's diagnostic contribution from each. 5. **Honest scope claims.** Cross-family $\lambda$ scaling is now framed as a within-family trend (Pythia $n = 5$) rather than a universal law; cross-family confounders (initialization recipe, normalization placement) are explicitly enumerated in Section 7 limitations. 6. **Mechanism positioning.** Selection-class candidate mechanisms (D1-D5, Appendix A.4) are now consistently framed as candidate hypotheses rather than established causes; controlled-ablation verification is deferred to follow-up work.
Assigned Action Editor: ~Vincent_Fortuin1
Submission Number: 9182
Loading