Keywords: Multimodal
TL;DR: We align the SONAR concept embedding scheme used in Meta's Large Concept Model with Google's new SIG-LIP 2 vision encoder.
Abstract: We introduce SIGLIP-SONAR Concept Alignment (SSCA), a novel framework that transforms visual representation learning by aligning SigLIP-2 visual embeddings with SONAR semantic concept embeddings rather than traditional text tokens. This approach fundamentally reimagines cross-modal alignment by targeting language-agnostic semantic concepts instead of linguistically-constrained tokens. Our architecture implements a sophisticated multi-stage refinement process with cross-modal attention mechanisms and gated information flow to preserve critical visual features while enabling semantic enrichment. Using a sigmoid-based contrastive loss function with a learnable temperature parameter, SSCA achieves superior training stability while mitigating representation collapse. Experimental results on the COCO and XM3600 datasets demonstrate remarkable text-to-image retrieval performance (60.3\% and 78.1\% R@1, respectively) after minimal training on CC12M, with particularly strong cross-lingual generalization despite training exclusively on English descriptions. Our findings indicate that aligning images with semantic concepts rather than text tokens can provide a more robust foundation for visual understanding systems, potentially transforming how we approach vision-language alignment and multilingual visual reasoning.
Submission Number: 36
Loading