Keywords: Numerical Representations, Number-Line Geometry, Pretraining Data, Integer Frequency, Power-Law Distributions, Numerical Reasoning, Representation Analysis
TL;DR: LLMs’ internal number-line geometry reflects how frequently numbers appear in their pretraining data, with flatter numerical frequency distributions leading to less compressed representations and better number-comparison accuracy.
Abstract: Large language models exhibit compressed, non-uniform internal representations of numerical magnitude, but the pretraining factors associated with this geometry remain unclear. We study whether corpus-level integer statistics are related to the learned number-line geometry of these models. For four documented pretraining corpora, we count integers in $[0:10{,}000]$ and fit a magnitude-frequency power law, $\mathrm{count}(N) \propto N^{\alpha}$, where more negative $\alpha$ indicates steeper decay and less exposure to large magnitudes. For nine corresponding base models, we extract hidden states for numerical prompts, project them onto a one-dimensional number line with PCA, and estimate a scaling factor $\beta$, where smaller $\beta$ indicates stronger compression. We first show that $\beta$ is behaviorally meaningful: models with less compressed number-line geometry achieve higher likelihood-based number-comparison accuracy. We then find that flatter integer-frequency distributions, corresponding to less negative $\alpha$, are associated with larger $\beta$, providing correlational evidence that pretraining integer statistics are reflected in LLM number representations.
Track: Track 2: ML Research by Muslim Authors
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.
Submission Number: 97
Loading