Scaling Behavior for Numeral System: Tokenize Your Numbers into $1$-digit

Scaling Behavior for Numeral System: Tokenize Your Numbers into $1$-digit

ACL ARR 2024 June Submission4853 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into $1$-digit, and 2) Tokenize into $1\sim 3$ digit. The difference is roughly equivalent to using different numeral systems (namely base $10$ or base $10^{3}$). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base $10$ system is consistently more data-efficient than a base $10^{2}$ or $10^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar performances when fine-tuned. Through thorough analysis and experiments, we conclude that tokenizing numbers into $1$-digit is more favorable for LLMs in numerical operations. Additionally, we reveal \textit{extrapolation} behavior patterns on addition and multiplication that sheds light on the mechanism learnt by the models.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Scaling Law, Large Language Models, Numeric Operation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 4853

Loading