Pre-Tokenization of Numbers for Large Language Models

Zhenglong Wu; Qi Qi; Zirui Zhuang; Haifeng Sun; Jingyu Wang

Pre-Tokenization of Numbers for Large Language Models

Zhenglong Wu, Qi Qi, Zirui Zhuang, Haifeng Sun, Jingyu Wang

Published: 19 Mar 2024, Last Modified: 01 Jun 2024Tiny Papers @ ICLR 2024 ArchiveEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, tokenization, time series

TL;DR: Proposed a pre-tokenization algorithm, tested LLM with zero-shot time series forecasting.

Abstract: There have been many recent advances in LLM for zero-shot tasks. While these models have shown great promise, pre-tokenization process methods for numbers are mostly empirical and lack experimental justification. In this paper, we analyze tokenization of numbers in through time series forecasting. We conducted experiments to evaluate the impact of different factors on BPE tokenizers, and proposed a novel pre-tokenization algorithm that is justified to maintain the balance between details and memory cost. Our analysis highlights the importance of a systematic understanding of the pre-tokenization process and provides a foundation for further exploration.

Submission Number: 231

Loading