Keywords: pixel-based text representations, machine translation, multilinguality, cross-lingual transfer, unseen scripts
TL;DR: We augment pretrained language models with pixel-based text representations, overcoming vocabulary constraints, improving multilingual and cross-script performance
Abstract: Subword tokenization requires balancing computational efficiency and vocabulary coverage, often leading to suboptimal performance on languages and scripts not prioritized during training.
We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered to pixels.
Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods.
Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion.
Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1403
Loading