Number Value Loss in LLMs and N-adic Tokenization

03 May 2026 (modified: 10 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper provides a theoretical analysis of metrics for numerical value comparison and reasoning in machine learning. We address the root causes of optimization failure in numerical representations through three primary contributions. First, we introduce MALL (Magnitude-Aware Log Loss), a metric designed to ensure gradient stability and sensitivity across more than 30 orders of magnitude. We demonstrate that MALL maintains a robust signal for both global magnitude and local precision across the entire $\mathbb{R}^{+}$ domain, resolving the vanishing and exploding gradient problems inherent in traditional metrics. This ensures a stable foundation for numerical reasoning of bijective to number value tokenizations, decoded token sequences or regressions and makes MALL a superior drop-in replacement and a standalone baseline for numerical comparison. Second, we identify the Softmax boundary problem --- a fundamental structural failure at digit-order transitions caused by the interplay between independent positional distributions and positional tokenization. We establish a No-Go theorem proving that additive per-token continuous losses are mathematically incompatible with numerical stability over large ranges. Consequently, we demonstrate that structured discontinuities in the gradient field act as a necessary catalyst for global consistency and propose a deferred global loss with hardmax as a regularization strategy to stabilize this behavior. Third, we propose a geometrical embedding regularizer, Triangle Loss, based on the triangle inequality to enforce numerical continuity within the embedding manifold. By ensuring that the geometric relationships between embeddings reflect their numerical distances, Triangle Loss improves generalization for rare tokens in any numerical bijective tokenization and provides a structural basis for learning numerical proximity at extreme scales. Through mathematical proofs and gradient field visualizations, we demonstrate that our framework addresses the fundamental limitations of current numerical objectives, providing a robust foundation for coherent numerical intelligence in neural architectures.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=iSk4qOqsfW
Changes Since Last Submission: This is a resubmission following a 'wrong format' desk rejection. The scientific content is unchanged. We have fixed the page size to Letter (removed a4paper).
Assigned Action Editor: ~Nicolas_A._Gontier1
Submission Number: 8744
Loading