MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim; Kangwook Jang; Sangmin Bae; Sungwoo Cho; Se-Young Yun

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a scalable and robust AVSR framework that leverages MoE with hierarchical gating to adaptively utilize audio and visual expert groups.

Abstract: Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Lay Summary: Understanding speech in noisy environments is difficult for both humans and machines. While combining audio and visual signals (like lip movements) can help, current audio-visual speech recognition (AVSR) systems struggle to scale without becoming inefficient. To solve this, we develop MoHAVE, a new AI model that smartly divides the work between specialized audio and visual components. These components, called “experts”, are activated based on the input, so the model only uses what it needs—saving time and computing power. Our approach includes a clever decision system that learns how to balance these expert groups dynamically, depending on the level and type of noise. As a result, MoHAVE is both more accurate and more efficient than existing AVSR models.

Primary Area: Applications->Language, Speech and Dialog

Keywords: audio-visual speech recognition, mixture-of-experts, multimodal

Submission Number: 6447

Loading