Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

Umberto Cappellazzo, Minsu Kim, Stavros Petridis

Published: 10 Mar 2025, Last Modified: 14 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, par- ticularly in noisy environments. Recent ad- vancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substan- tial computational costs. Prior approaches ad- dress this by compressing speech representa- tions before feeding them into LLMs. However, higher compression ratios often lead to perfor- mance degradation, necessitating a trade-off between computational efficiency and recogni- tion accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka- based Multimodal LLM for AVSR, which en- ables flexible adaptation of the audio-visual token allocation based on specific computa- tional constraints while preserving high perfor- mance. Our approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model, eliminating the need to train separate models for different compression lev- els. Moreover, to efficiently fine-tune the LLM, we introduce three LoRA-based Matryoshka strategies using global and scale-specific LoRA modules. Extensive evaluations on the two largest AVSR datasets demonstrate that Llama- MTSK achieves state-of-the-art results, match- ing or surpassing models trained independently at fixed compression levels.