Pretraining Numerical Frequency and Number-Line in Language Models

Mohammed Ibrahim Awad; Ahmed Elshehaby; Hilal AlQuabeh; Velibor Bojkovic

Pretraining Numerical Frequency and Number-Line in Language Models

Mohammed Ibrahim Awad, Ahmed Elshehaby, Hilal AlQuabeh, Velibor Bojkovic

Published: 14 Jun 2026, Last Modified: 21 Jun 2026ICML 2026 Workshop MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Numerical Representations, Number-Line Geometry, Pretraining Data, Integer Frequency, Power-Law Distributions, Numerical Reasoning, Representation Analysis

TL;DR: LLMs’ internal number-line geometry reflects how frequently numbers appear in their pretraining data, with flatter numerical frequency distributions leading to less compressed representations and better number-comparison accuracy.

Abstract: Large language models exhibit compressed, non-uniform internal representations of numerical magnitude, but the pretraining factors associated with this geometry remain unclear. We study whether corpus-level integer statistics are related to the learned number-line geometry of these models. For four documented pretraining corpora, we count integers in $[0:10{,}000]$ and fit a magnitude-frequency power law, $\mathrm{count}(N) \propto N^{\alpha}$, where more negative $\alpha$ indicates steeper decay and less exposure to large magnitudes. For nine corresponding base models, we extract hidden states for numerical prompts, project them onto a one-dimensional number line with PCA, and estimate a scaling factor $\beta$, where smaller $\beta$ indicates stronger compression. We first show that $\beta$ is behaviorally meaningful: models with less compressed number-line geometry achieve higher likelihood-based number-comparison accuracy. We then find that flatter integer-frequency distributions, corresponding to less negative $\alpha$, are associated with larger $\beta$, providing correlational evidence that pretraining integer statistics are reflected in LLM number representations.

Track: Track 2: ML Research by Muslim Authors

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Non Archival Confirmation: I understand that submissions to MusIML are non-archival and can be submitted to other venues.

Submission Number: 97

Loading