C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal DataDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 05 Nov 2023ACM Multimedia 2023Readers: Everyone
Abstract: Massive numbers of new images are uploaded to the internet every day. However, existing cross-modal retrieval (CMR) approaches struggle to accommodate this continuously growing data. The prevalent practice involves periodically retraining or fine-tuning a new model based on the accumulated data, which in turn invalidates billions of indexed features extracted by the previous model and incurs another substantial computational cost to extract new features for the entire data archive. Is it possible to develop a retrieval model that effectively captures the knowledge of upcoming sessions while preserving the discriminative power of features extracted in previous sessions? In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. It consists of two key settings: 1) Similar to the real-world scenarios, the streaming multi-modal data arrives once per session; 2) Consider the computational costs, each instance of archived data has its feature extracted only once and by its corresponding model in its session. Based on our OC-CMR, we perform in-depth evaluations of state-of-the-art cross-modal retrieval methods and observe that they suffer from representational shift and collapse due to the catastrophic forgetting. To address this issue, we propose the Continual Cross-Modal Retrieval (C2MR) approach, which learns a shared common space not only across modalities but also sessions and maintains relationships between samples from distinct sessions via cross-modal relational coherence and semantic representation coordination. We construct two new benchmarks by adapting MS-COCO and Flickr30K datasets to the OC-CMR setting, providing a more challenging evaluation framework for CMR tasks. Experimental results demonstrate that our method effectively alleviates forgetting and significantly outperforms combinations of previous arts in cross-modal retrieval and continual learning.
0 Replies

Loading