Overcoming Vocabulary Constraints with Pixel-level Fallback

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: pixel-based text representations, machine translation, multilinguality, cross-lingual transfer, unseen scripts
TL;DR: We augment pretrained language models with pixel-based text representations, overcoming vocabulary constraints, improving multilingual and cross-script performance
Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, often leading to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered to pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1403
Loading