Keywords: Interpretability, Large language models, Information theory
TL;DR: A single entropy value per layer uncovers how frozen transformers compute across models, tasks, and domains.
Abstract: Transformer blocks iteratively refine next-token distributions, yet most interpretability tools analyze hidden states rather than token-space dynamics.
We introduce Entropy-Lens, a model-agnostic method that tracks the entropy of logit-lens predictions across layers, yielding an entropy profile: a per-layer, permutation-invariant scalar summary of token prediction dynamic.
Entropy differences between consecutive layers act as a proxy for two strategies: expansion (more candidates) and pruning (fewer candidates).
Across model families and scales, entropy profiles show stable family-specific token prediction dynamics and exhibit depth-rescaling invariance.
Finally, selectively skipping layers associated with maximal expansion or pruning shows that the two strategies have unequal functional importance for downstream multiple-choice accuracy, with expansion typically being more critical.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 114
Loading