Vision Language Meets Transformers

ACL ARR 2026 January Submission6681 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Language, BERT, Transformer, BPE, ImageNet, Token
Abstract: Recent vision transformers model images as token sequences but rely on patch-based or continuous pixel embeddings. We extend the vision-language paradigm by representing images as textual sequences derived directly from pixels and modeling them with transformer-based language models. Pixel intensities are mapped to Unicode characters and tokenized using Byte Pair Encoding (BPE), after which a BERT-style transformer is trained on the resulting sequences. As a proof of concept, we pretrain on ImageNet-1K and fine-tune on MNIST and Fashion-MNIST. Results show that vision-language-based transformer models can effectively operate on pixel-derived text and benefit from scalable vocabularies, framing images as a discrete and extensible language.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision-language models, transformers, representation learning, tokenization, symbolic representations
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: Vision Language (Images as a language)
Submission Number: 6681
Loading