Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages

TMLR Paper7693 Authors

26 Feb 2026 (modified: 05 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Tokenization and sub-tokenization based models like word2vec, BERT and the GPT-like models are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture or- thographic similarities and morphological variations, especially in highly inflected and low- resource languages. Additionally, the needed sub-tokenization for these languages increases the length of the input sequence. The input sequence is even larger for purely byte- or character-based models. To mitigate this problem, we propose a method to compute word vectors directly from character strings, integrating both semantic and syntactic information. We denote this trunsformer-based approach Rich Character Embeddings (RCE). Further- more, we propose a hybrid model that combines trunsformer and convolutional mecha- nisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for low-resource and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it out- performs traditional token-based approaches on limited data using Odd-One-Out and TopK metrics as well as on application-based downstream tasks.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Adin_Ramirez_Rivera1
Submission Number: 7693
Loading