This codebase is built upon the open-source VideoLLaMA2 model.

To implement our proposed Audio-Visual Contrastive Decoding (AVCD), we introduced key modifications to the following components:

videollama2/__init__.py
– Handles the model’s forward execution for next-token generation under the AVCD setting, incorporating entropy-guided adaptive decoding and modality-aware intervention.

videollama2/model/qwen.py
– Implements the core logic for analyzing modality dominance and applying attentive masking strategies during inference.

Note: While this release integrates AVCD through __init__.py, we plan to release an alternative version that embeds AVCD logic directly into transformer/utils.py for improved efficiency.

