Abstract: Audio-visual matching techniques aim to recognize and match information across different identities by learning a similarity metric across modalities. However, modal differences arise from insufficient cross-modal correlations and noise interference, which substantially hinder the performance of traditional deep metric learning methods in audio-visual matching tasks. To address the modal differences issue, we propose a novel Adaptive Interactive and Correction Attention Network (AICANet). This network efficiently captures deep information connections, generating modality-consistent feature embeddings within a unified metric framework. The core of AICANet is its two-pronged approach to reducing modal differences. First, we propose the Adaptive Interactive Attention (AIA) module, which flexibly establishes associations among cross-modal local features using dynamically generated pseudo-labels. Second, we propose the Adaptive Correction Attention (ACA) mechanism, which employs an adaptive threshold to de-interference effectively and accurately adjust the representation of local feature associations. Notably, the ACA mechanism is suitable for both intra-modal and inter-modal refined attention correction. Additionally, we design a relative distance stretching metric loss ($\mathcal{L}_{RDSM}$), which reinforces the similarity invariance of feature embeddings in a uniform space and enhances matching accuracy. Extensive tests on the VoxCeleb and VoxCeleb2 datasets demonstrate that AICANet outperforms leading existing algorithms across several evaluation metrics, validating its superior performance. The codes can be found at https://github.com/w1018979952/AICANet.
Loading