Abstract: This paper presents a new approach to dense retrieval across multiple modalities, emphasizing the integration of images and sensor data. Traditional cross-modal retrieval techniques face significant challenges, particularly in processing non-linguistic modalities and creating effective training datasets. To address these issues, we propose a method that uses a shared vector space, optimized with contrastive loss, to enable efficient and accurate retrieval across diverse modalities. A key innovation of our approach is the introduction of a temporal closeness metric, which evaluates the relationship between data points based on their timestamps. This metric helps automatically extract positive and hard negative samples related to the query, improving the training process and enhancing the retrieval model’s performance. We validate our approach using the Lifelog Search Challenge 2024 (LSC’24) dataset, one of the largest multi-modal datasets, including non-linguistic data such as egocentric images, heart rate, and location information. Our evaluation shows that incorporating temporal closeness into the dense retrieval process significantly improves retrieval accuracy and robustness in real-world, multi-modal scenarios. This paper’s contributions include developing a novel dense retrieval framework, introducing the temporal closeness metric, and successfully applying these innovations to a comprehensive multi-modal dataset.
External IDs:dblp:conf/mmm/YamamotoK25
Loading