VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning

Nanyi Fei, Hao Jiang, Haoyu Lu, Jinqiang Long, Yanqi Dai, Tuo Fan, Zhao Cao, Zhiwu Lu

Published: 2024, Last Modified: 08 Jan 2026ECIR (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.