Keywords: large language model, tokenization, time series
TL;DR: Proposed a pre-tokenization algorithm, tested LLM with zero-shot time series forecasting.
Abstract: There have been many recent advances in LLM for zero-shot tasks. While these models have shown great promise, pre-tokenization process methods for numbers are mostly empirical and lack experimental justification. In this paper, we analyze tokenization of numbers in through time series forecasting. We conducted experiments to evaluate the impact of different factors on BPE tokenizers, and proposed a novel pre-tokenization algorithm that is justified to maintain the balance between details and memory cost. Our analysis highlights the importance of a systematic understanding of the pre-tokenization process and provides a foundation for further exploration.
Submission Number: 231
Loading