How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

ICML 2025 Workshop TokShop Submission34 Authors

Published: 10 Jun 2025, Last Modified: 11 Jun 2025TokShopEveryoneRevisionsBibTeXCC BY 4.0
Archiving Submission: No (non-archival)
Keywords: Tokenization; Language Model; Probing; Data Efficient Finetuning
Abstract: Tokens serve as the fundamental units in language models (LMs) for processing input, generated through the process of tokenization. Tokenization can split a word into multiple subwords, a process that differs significantly from how humans perceive words, particularly in phonology. In this work, we examine two types of phonological features: local phonological coherence and prosodic structure. Using probing techniques, we demonstrate that tokenization impairs LMs' ability to capture phonological features. Furthermore, we show that tokenization affects LMs' inference results, which is one of their primary applications. Finally, we propose a data-efficient fine-tuning approach for large language models (LLMs) that leverages their pre-trained pronunciation knowledge, significantly enhancing inference performance on phonology-related tasks while preserving the model's ability on other tasks.
Submission Number: 34
Loading