Tokenization as Cultural Erasure: How Corpus Composition Shapes the Representation of Aymara Morphology in NLP Systems
Keywords: Low-Resource NLP, Morphological Tokenization, Indigenous Language Processing, Agglutinative Languages, Multilingual Machine Translation
TL;DR: Training tokenizers on morphologically simple Aymara forms improves translation quality and preserves compositional linguistic structure better than standard full corpus tokenization approaches.
Abstract: Tokenization is not a neutral preprocessing step for agglutinative languages whose morphology encodes culturally meaningful distinctions. In Aymara, evidentiality, temporal orientation, and relational meaning are expressed through productive morpheme combinations that may become obscured when tokenizers are trained primarily on frequent surface forms. We present a controlled study of five SentencePiece Unigram tokenizers trained on linguistically stratified Spanish--Aymara corpora containing 17,856 translation pairs. Across 15 training runs with identical downstream T5 architectures, the tokenizer trained exclusively on morphologically simple forms achieves the strongest performance at every evaluation level, reaching $17.01 \pm 0.23$ chrF globally and $17.73 \pm 0.40$ chrF on compositional structures despite having the highest fertility and smallest vocabulary. We further show that a commonly used morpheme integrity metric may systematically favor boundary fusion in agglutinative settings, assigning the best-performing tokenizer the lowest score because correct segmentation reduces surface-form preservation. Based on these findings, we propose the Morphological Boundary Hypothesis: tokenizers trained on morphologically simple forms learn reusable roots and suffixes as independent vocabulary units, enabling better compositional generalization downstream. Our results suggest that tokenizer corpus composition substantially influences morphological representation quality in low-resource agglutinative language systems and that morphologically grounded tokenization can improve translation performance with minimal additional computational cost.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 60
Loading