EQUALS: An Audio-Visual LLM with One-Stage Question-Guided Alignment and Flexible Fusion

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: audio-visual question answering,multimodal alignment,multimodal fusion,visual compression,one-stage
TL;DR: An Audio-Visual LLM with One-Stage Question-Guided Alignment and Flexible Fusion
Abstract: Audio-Visual Question Answering (AVQA) has emerged as a crucial task for multimodal reasoning in human-computer interaction, requiring models to align and interpret visual and auditory signals conditioned on natural language questions. Despite recent progress, three key challenges remain: (1) difficulty in locating question-relevant segments within lengthy and redundant video streams, (2) suboptimal audio-visual alignment due to the decoupling between pretraining and task-specific supervision, and (3) insufficient flexibility in fusion strategies across diverse tasks. To address these issues, we propose EQUALS (onE-stage Question gUided Alignment and fLexible fuSion), a unified end-to-end AVQA framework. EQUALS integrates compression, alignment, and fusion within a single stage. Specifically, we interleave optimal transport-based loss modules before and after the question-guided pooling module to achieve fine-grained semantic alignment. To enhance adaptability in fusion, we introduce FlexFuseMoE, a mixture-of-experts module that supports early, mid, and late fusion via flexible expert routing. Experiments on MUSIC-AVQA and its challenging variant FortisAVQA demonstrate that EQUALS achieves new state-of-the-art results with interpretability. Our findings highlight the importance of jointly modeling alignment and fusion under explicit question guidance, offering a flexible and scalable solution for audio-visual understanding.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11465
Loading