Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities

Haoyu Tang; Yupeng Hu; Yunxiao Wang; Shuaike Zhang; Mingzhu Xu; Jihua Zhu; Qinghai Zheng

Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities

Haoyu Tang, Yupeng Hu, Yunxiao Wang, Shuaike Zhang, Mingzhu Xu, Jihua Zhu, Qinghai Zheng

Published: 01 Jan 2024, Last Modified: 20 Jul 2025Inf. Fusion 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•We design a multi-modal information fusion based cross-modal retrieval model.•Our model can localize the target audio events via the given textual queries.•We develop an interactive graph to capture vital cross-modal semantic correlations.•We execute extensive comparison experiments on the benchmark dataset.•Experimental results have proved the superior model performance over the baselines.

Loading