Abstract: Speaker tracking plays a crucial role in various human-robot interaction applications. Recently, leveraging multimodal information, such as audio and visual signals, has become an important strategy for enhancing the robustness of the tracking system. However, current methods face challenges in effectively exploring the complementarity between audio and visual modalities. To this end, we propose an Audio-Visual Tracker based on Multi-Stage Multimodal Distillation (MSMD-AVT), which utilizes an audio-visual knowledge distillation framework to facilitate audio-visual information fusion over multiple stages progressively. MSMD-AVT is constructed based on an audio-visual teacher-student model incorporating three distinct distillation losses. During the feature extraction stage, the feature alignment distillation is designed to ensure that the feature representations from the student network remain consistent with the teacher encoding feature. Moreover, during the feature fusion stage, the fusion guidance distillation is proposed, using deep teacher features to guide the multimodal fusion process in the student network, optimizing the complementary benefits of audio-visual fusion. Finally, the logits distillation is applied during the position estimation stage to help the student model better capture localization features through knowledge transfer and output alignment. Additionally, we present a multimodal fusion module based on a bidirectional cross-attention mechanism in the student network, dynamically adjusting the effectiveness of different modal features for the tracking task by extracting complementary audio-visual contextual information. Extensive experimental results on the widely used AV16.3 dataset indicate that MSMD-AVT significantly outperforms existing state-of-the-art methods in terms of accuracy and robustness. Our code is publicly available at https://github.com/moyitech/MSMD-AVT.
External IDs:dblp:conf/icassp/LiZWXRS25
Loading