Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Audio-visual LLMs, Multimodal LLMs, Multimodal LLM inference
TL;DR: Fork-Merge Decoding enhances balanced multimodal reasoning in audio-visual large language models by separating audio and visual understanding in the early decoder layers and merging them in the later layers—all without requiring additional training.
Abstract: The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs) by addressing modality bias without additional training. In current AV-LLMs, audio and video features are typically processed jointly in the decoder. While this strategy facilitates unified multimodal understanding, it may introduce modality bias, where the model tends to over-rely on one modality due to imbalanced training signals. To mitigate this, we propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications. FMD first performs modality-specific reasoning by processing audio-only and video-only inputs through the early decoder layers (fork), and then merges the resulting hidden states for joint reasoning in the remaining layers (merge). This separation allows each modality to be emphasized in the early stages while encouraging balanced contributions during integration. We validate our method on three representative AV-LLMs—VideoLLaMA2, video-SALMONN, and Qwen2.5-Omni—using three benchmark datasets. Experimental results show consistent gains in audio, video, and audio-visual reasoning tasks, highlighting the effectiveness of inference-time interventions for robust and efficient multimodal understanding.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4716
Loading