Latent Segment Language Models

ACL ARR 2024 June Submission4653 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Tokenization is a critical step in every NLP system, yet most works treat it as an isolated component separate from the models they are building. In this paper, we present a framework to jointly learn next-token prediction and segmentation from a sequence of characters or bytes. We evaluate our model on language modeling benchmarks in English, Chinese, and Japanese using both character and byte vocabularies. Our model consistently outperforms baselines on Chinese benchmarks with character vocabulary and shows significant improvements with byte vocabulary. Further latency improvements are achieved by adapting different pooling strategies while maintaining comparable results to the best models. Our main contributions are threefold: we propose a language model that learns to segment the input sequence, conforming to the desired segmentation prior; we demonstrate that our model achieves shorter latency than baselines in token generation; and we show that our model can be applied to three different languages—English, Chinese, and Japanese—demonstrating its potential for wider NLP applications. Our source code will be released on GitHub.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training, dynamic segmentation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, Chinese, Japanese
Submission Number: 4653
Loading