Cross-Modal Transformers for Audio-Visual Person Verification

Published: 01 Jan 2024, Last Modified: 04 Nov 2024Odyssey 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although person verification is predominantly explored using voice and faces independently, audio-visual fusion has recently gained a lot of attention as they often compensate for each other. However, most existing works on audiovisual fusion for person verification rely on early feature concatenation or score-level fusion. Recently, transformers have been found to be promising for several applications. In this work, we have explored the potential of Cross-Modal Transformers (CMT) for effective fusion of audio and visual modalities for person verification. In particular, we have explored cross-attention using transformers, where the embeddings of one modality are exploited to attend to another modality to capture the complementary relationships. Extensive experiments are carried out on the Voxceleb1 dataset and show that the proposed approach effectively captures the complementary relationships across audio and visual modalities while outperforming the state-of-the-art approaches.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview