Enhancing Contextual Understanding with Multimodal Siamese Networks Using Contrastive Loss and Text Embeddings

Andro Aprila Adiputra, Ahmada Yusril Kadiptya, Thi-Thu-Huong Le, Junyoung Son, Howon Kim

Published: 2025, Last Modified: 17 Nov 2025ICAIIC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Deep learning has achieved significant advances in image representation learning, yet it remains constrained by challenges such as imbalanced datasets and limited contextual understanding of paired data. To address these issues, we propose a novel multimodal approach that integrates Contrastive Siamese Neural Networks with text embeddings generated using vision language models (VLMs) especially Pixtral. Our method aims to enhance contextual alignment between paired images by combining image embeddings and text embeddings derived from language models such as BERT or RoBERTa. Inspired by the architecture of CLIP, which synchronizes image and text encoders, our approach adapts contrastive learning to focus specifically on image embeddings while leveraging text embeddings to enrich the context. This multimodal framework is evaluated on both imbalanced and balanced datasets to determine its robustness and effectiveness. Key contributions include analyzing the role of generated text in providing context to images and demonstrating the potential of Siamese networks in multimodal settings. The experimental results highlight the advantages of our approach in improving contextual understanding and improving overall performance in balanced and imbalanced dataset settings.

External IDs:dblp:conf/icaiic/AdiputraKLS025