LAVViT: Latent Audio-Visual Vision Transformers for Speaker Verification

R. Gnana Praveen, Jahangir Alam

Published: 01 Jan 2025, Last Modified: 31 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, Vision Transformers (ViTs) have shown remarkable success in various computer vision applications. In this work, we have explored the potential of ViTs, pre-trained on visual data, for audio-visual speaker verification. To cope with the challenges of large-scale training, we introduce the Latent Audio-Visual Vision Transformer (LAVViT) adapters, where we exploit the existing pre-trained models on visual data without fine-tuning their parameters and train only the parameters of LAVViT adapters. The LAVViT adapters are injected into every layer of the ViT architecture to effectively fuse the audio and visual modalities using a small set of latent tokens, forming an attention bottleneck, thereby reducing the quadratic computational cost of cross-attention across the modalities. The proposed approach has been evaluated on the Voxceleb1 dataset and shows promising performance using only a few trainable parameters. Code is available at https://github.com/praveena2j/LAVViT