Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities

Published: 01 Jan 2024, Last Modified: 25 Mar 2025Inf. Fusion 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highlights•We design a multi-modal information fusion based cross-modal retrieval model.•Our model can localize the target audio events via the given textual queries.•We develop an interactive graph to capture vital cross-modal semantic correlations.•We execute extensive comparison experiments on the benchmark dataset.•Experimental results have proved the superior model performance over the baselines.
Loading