Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger; Lars Pennig; Kacper Chlodny; Vincent Limbach; Anna Ketteler; Thorben Prein; Vishwa Mohan Singh; Michael Danziger; Jannis Born

Regress, Don’t Guess – A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger, Lars Pennig, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Danziger, Jannis Born

Published: 10 Oct 2024, Last Modified: 31 Oct 2024MATH-AI 24EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Language models, mathematical reasoning, number representation, Number Token Loss, regression loss

TL;DR: Introduces the Number Token Loss (NTL), a regression-like loss on number tokens that improves language models' arithmetic reasoning by considering numerical proximity during training.

Abstract: While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.

Concurrent Submissions: N/A

Submission Number: 4

Loading