SSCA: SigLIP-2 Sonar Concept Alignment

Trevor Brokowski; Alexandre Sallinen; Mary-Anne Hartley

SSCA: SigLIP-2 Sonar Concept Alignment

Trevor Brokowski, Alexandre Sallinen, Mary-Anne Hartley

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal

TL;DR: We align the SONAR concept embedding scheme used in Meta's Large Concept Model with Google's new SIG-LIP 2 vision encoder.

Abstract: We introduce SIGLIP-SONAR Concept Alignment (SSCA), a novel framework that transforms visual representation learning by aligning SigLIP-2 visual embeddings with SONAR semantic concept embeddings rather than traditional text tokens. This approach fundamentally reimagines cross-modal alignment by targeting language-agnostic semantic concepts instead of linguistically-constrained tokens. Our architecture implements a sophisticated multi-stage refinement process with cross-modal attention mechanisms and gated information flow to preserve critical visual features while enabling semantic enrichment. Using a sigmoid-based contrastive loss function with a learnable temperature parameter, SSCA achieves superior training stability while mitigating representation collapse. Experimental results on the COCO and XM3600 datasets demonstrate remarkable text-to-image retrieval performance (60.3\% and 78.1\% R@1, respectively) after minimal training on CC12M, with particularly strong cross-lingual generalization despite training exclusively on English descriptions. Our findings indicate that aligning images with semantic concepts rather than text tokens can provide a more robust foundation for visual understanding systems, potentially transforming how we approach vision-language alignment and multilingual visual reasoning.

Submission Number: 36

Loading