MultiModal Code-Switching: Interleaving Visual Objects into Language for Explicit Object-Level Alignment

ACL ARR 2026 January Submission8116 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal large language models, modality alignment pretraining, object-level alignment
Abstract: Existing Multimodal Large Language Models (MLLMs) predominantly rely on image-text pairs for modality alignment pretraining, mapping global image representations to long textual descriptions. However, this image-level alignment suffers from \textit{referential ambiguity}: the model must struggle to infer the correspondences between multiple visual objects and textual entities from the global representation, leading to data inefficiency and suboptimal semantic grounding. To address this, we propose MultiModal Code-Switching (MMCS), a novel pretraining paradigm that enables explicit object-level supervision. Inspired by linguistic code-switching, MMCS interleaves vision and language by replacing textual entity embeddings with embeddings of their corresponding visual objects, enforcing local visual–textual grounding during pretraining. We further develop a scalable data synthesis pipeline to generate 773k samples with accurate object–entity correspondences. Experiments across model scales show that MMCS is highly data-efficient: with only 50k samples, it matches or surpasses models trained on 600k image–caption pairs, while consistently improving visual grounding and perception capabilities.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal pretraining,multimodality
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 8116
Loading