Beyond Explicit Tokenization: Investigating Transformer Limitations with Subword Granularity

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: subword tokenization, transformer architecture, latent variable models
TL;DR: We demonstrate that transformers have an inherent difficulty combining representations of smaller units (subwords) into larger units (tokens) as the granularity increases.
Abstract: In NLP, it is widely assumed that tokens processed by a language model should have some semantic meaning, roughly corresponding to morphemes. However, as the field steadily moves toward end-to-end architectures optimized for downstream tasks, extracting explicit linguistic features is becoming obsolete. Rather than attempting to make tokenization more morphological, we believe that explicit tokenization separate from language models should be eliminated altogether. To demonstrate where transformers struggle with subword-level tokenization, we designed a simple synthetic task. We generate random sequences of $n$-bit words and divide each into $k$-bit subwords, where $k \leq n$. We then apply a random function $\sigma$, which acts as a word-for-word permutation on the set of all $n$-bit words. We trained a transformer-based encoder-decoder model to discover $\sigma$ for $n=8$ and $k \in \{1, 2, 4, 8\}$. The model consistently converges to a zero validation loss for $k=4$ and $k=8$, but fails for $k=1$ and $k=2$. Some NLP practitioners argue that the only drawback of character- and byte-level models is that they perform significantly slower than subword models with similar accuracy. However, this task reveals a deeper issue: transformers have an inherent difficulty combining representations of smaller units (characters/subwords) into larger units (tokens). To make transformers capable of solving this task, we added recurrent neural networks (RNNs) before and after the transformer for downsampling and upsampling, respectively. We chose RNNs over other pooling methods because they are autoregressive. This makes them naturally order-dependent, since function composition is not commutative. With the addition of RNN-based pooling, our model consistently converges to a zero validation loss for all $k \in \{1, 2, 4, 8\}$. Our current research aims to develop an end-to-end architecture capable of learning optimal token boundaries by modeling them as latent variables. By combining this approach with RNN-based pooling, we hope to completely eliminate the downsides of explicit subword tokenization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 172
Loading