Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger; Lars Pennig; Anamarija Kozina; Sean Sdahl; Julian Sikora; Adrian Dendorfer; Timofey Kuznetsov; Mohamad Hagog; Nina Wiedemann; Kacper Chlodny; Vincent Limbach; Anna Ketteler; Thorben Prein; Vishwa Mohan Singh; Michael Danziger; Jannis Born

Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Danziger, Jannis Born

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: The Number Token Loss (NTL) is a regression-like loss on number tokens that augments cross-entropy to improve Language Models in numerical tasks.

Abstract: While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the Cross Entropy loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed **Number Token Loss** (NTL) comes in two flavors and minimizes either the $\mathcal{L}_p$ norm or the Wasserstein distance between the *numerical values* of the real and predicted number tokens. NTL can easily be added to any language model and extend the Cross Entropy objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope that this work can inspire LLM developers to improve their pretraining objectives. The code is available via: https://tum-ai.github.io/number-token-loss/

Lay Summary: Large language models are great at writing documents and answering questions, but when it comes to math, they often make mistakes. A key reason is that these models do not have built-in understanding for how numbers relate to one another. For example, they treat the numbers “2” and “3” as just different words, not as digits that are close together. To address this, we developed a new way to train language models by giving the models additional feedback on numbers. Our method, called Number Token Loss (NTL), explicitly teaches models to understand that “2” and “3” are numerically close, and “2” and “9” are farther apart. It analyzes how much the model’s predictions need to shift to match the correct value, based on the numerical distance between the predicted number probabilities and true values. We tested this on math problems and found that it consistently improves performance. Importantly, our method can be used by any Language Model, is fast to compute and easy to integrate.

Link To Code: https://tum-ai.github.io/number-token-loss/

Primary Area: Deep Learning->Large Language Models

Keywords: language models, mathematical reasoning, arithmetics, number representation, number token loss

Submission Number: 8153

Loading