Vision Language Meets Transformers

Vision Language Meets Transformers

ACL ARR 2026 January Submission6681 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language, BERT, Transformer, BPE, ImageNet, Token

Abstract: Recent vision transformers model images as token sequences but rely on patch-based or continuous pixel embeddings. We extend the vision-language paradigm by representing images as textual sequences derived directly from pixels and modeling them with transformer-based language models. Pixel intensities are mapped to Unicode characters and tokenized using Byte Pair Encoding (BPE), after which a BERT-style transformer is trained on the resulting sequences. As a proof of concept, we pretrain on ImageNet-1K and fine-tune on MNIST and Fashion-MNIST. Results show that vision-language-based transformer models can effectively operate on pixel-derived text and benefit from scalable vocabularies, framing images as a discrete and extensible language.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision-language models, transformers, representation learning, tokenization, symbolic representations

Contribution Types: NLP engineering experiment, Data analysis

Languages Studied: Vision Language (Images as a language)

Submission Number: 6681

Loading