Abstract: Highlights•We design a multi-modal information fusion based cross-modal retrieval model.•Our model can localize the target audio events via the given textual queries.•We develop an interactive graph to capture vital cross-modal semantic correlations.•We execute extensive comparison experiments on the benchmark dataset.•Experimental results have proved the superior model performance over the baselines.
Loading