Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}

Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}

ACL ARR 2025 May Submission7997 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As Large Language Models (LLMs) continue to scale, understanding how effectively their internal capacity is utilized becomes increasingly important, especially for inference-time efficiency. While existing scaling laws relate model size to loss and compute, they offer little insight into the representational dynamics of individual components. In this work, we focus on the Feed-Forward Network (FFN), a dominant sub-block in decoder-only transformers, and recast FFN width selection as a {\em spectral utilization} problem. We introduce a lightweight, differentiable diagnostic suite comprising: \textbf{Hard Rank} (Participation Ratio), Soft Rank (spectral entropy), Spectral Concentration, and the composite Spectral Utilization Index (SUI), designed to quantify how many latent directions are meaningfully activated. Our spectral audit across GPT-2, LLaMA, and nGPT models reveals that spectral utilization grows with model size but not monotonically with width, often peaking at intermediate dimensions (e.g., $D=2048$). We identify clear instances of \emph{spectral collapse}, where wider FFNs concentrate variance into a narrow subspace, leaving much of the latent space unused.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: spectral utilization

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7997

Loading