Abstract: Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains underexplored. We introduce PIXEL-M4, a model pretrained on four visually and linguistically diverse languages: English, Hindi, Ukrainian, and Simplified Chinese. Multilingual evaluations on semantic and syntactic tasks show that PIXEL-M4 outperforms an English-only counterpart on non-Latin scripts. Word-level probing analyses confirm that PIXEL-M4 captures rich linguistic features, even in languages not seen during pretraining. Furthermore, an analysis of its hidden representations shows that multilingual pretraining yields a semantic embedding space closely aligned across the languages used for pretraining. This work demonstrates that multilingual pretraining substantially enhances the capability of pixel language models to effectively support a diverse set of languages.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, cross-lingual transfer, multilingual representations, multilingual pre-training, multilingual evaluation, less-resourced languages
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English, French, German, Dutch, Estonian, Finnish, Hungarian, Polish, Serbian, Swedish, Turkish, Uzbek, Vietnamese, Hindi, Bengali, Tamil, Telugu, Standard Tibetan, Arabic, Egyptian Arabic, Urdu, Uyghur, Bulgarian, Kyrgyz, Macedonian, Ukrainian, Russian, Chinese, Korean, Japanese, Armenian, Coptic, Modern Greek, Hebrew
Submission Number: 4512
Loading