Abstract: Audio-Visual Question Answering (AVQA) requires
models to effectively utilize both visual and auditory modalities to
answer complex and diverse questions about audio-visual scenes.
However, existing methods lack sufficient flexibility and dynamic
adaptability in temporal sampling and modality preference
awareness, making it difficult to focus on key information based
on the question. This limits their reasoning capability in complex
scenarios. To address these challenges, we propose a novel
framework named AV-Master. It enhances the model’s ability
to extract key information from complex audio-visual scenes
with substantial redundant content by dynamically modeling both
temporal and modality dimensions. In the temporal dimension,
we introduce a dynamic adaptive focus sampling mechanism
that progressively focuses on audio-visual segments most relevant
to the question, effectively mitigating redundancy and segment
fragmentation in traditional sampling methods. In the modality
dimension, we propose a preference-aware strategy that models
each modality’s contribution independently, enabling selective
activation of critical features. Furthermore, we introduce a
dual-path contrastive loss to reinforce consistency and complementarity across temporal and modality dimensions, guiding
the model to learn question-specific cross-modal collaborative
representations. Experiments on four large-scale benchmarks
show that AV-Master significantly outperforms existing methods,
especially in complex reasoning tasks.
Loading