Tokenization on the Number Line is All You NeedDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: Despite the recent breakthroughs in language modeling, their ability to represent numbers is insufficient. Subword tokenization, the standard choice for number representation, breaks down a number into arbitrary chunks thereby failing to explicitly capture the relationship between two numbers on on the number-line. To alleviate this shortcoming, alternate approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods can be broadly classified into three categories that make changes to a) the notation (\eg scientific vs decimal) b) vocabulary (\eg introduce a new token for numbers in range $10-100$) and c) architectural changes to directly regress to a desired number. The contributions of this work are three fold -- firstly, we propose vocabulary level changes in the decoding stage and study its behavior. Next, we study the performance of both the proposed approach and existing number representation schemes in the context of masked number presentation. We find that a carefully designed tokenization scheme is both the simplest to implement and sufficient \ie with similar performance to the state-of-the-art approach that requires making significant architectural changes. Finally, we evaluate the various number representation schemes on the downstream task of numerical fact estimation (for fermi problems) in a zero-shot setting and find similar trends \ie changes at the tokenization level achieve near state-of-the-art results while requiring minimal resources compared to other number representation schemes.
0 Replies

Loading