Keywords: multimodal embedding space, multilingual embedding space
Abstract: We introduce V-SONAR, a vision–language embedding space extended from the
text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026),
which supports 1500 text languages and 177 speech languages. To construct
V-SONAR, we propose a post-hoc alignment pipeline that maps the representations
of an existing vision encoder into the SONAR space. We thoroughly evaluate
V-SONAR and show that its embeddings achieve competitive performance on
text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR
further surpasses state-of-the-art vision–language models on video captioning tasks,
including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0).
Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM;
LCM team et al. 2024) operating in SONAR and trained with English text only, can
perform both single- and multi-visual concept understanding in a zero-shot manner.
Finally, we introduce V-LCM, which extends the LCM with vision–language
instruction tuning. V-LCM encodes vision and language inputs into an unified
sequence of latent embeddings via V-SONAR and SONAR, and it is trained with
the same latent diffusion objective for next-embedding prediction as in LCM’s
text-only pre-training. Experiments on a large-scale multilingual and -modal
instruction–tuning data mixture highlight the potential of V-LCM: V-LCM matches
state-of-the-art vision-language models on tasks covering image/video captioning
and question answering, while significantly outperforming them across 61 rich- to
low-resource languages out of all 62 tested languages.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12990
Loading