Abstract: We present PAVE, a framework for adapting pre-trained video large language models (Video-LLMs) to downstream tasks that incorporate side-channel signals, such as audio, camera pose, or high frame rate videos. PAVE introduces a lightweight adaptation strategy called "patching", which adds a small number of parameters and operations to the base model without modifying its architecture or pre-trained weights. We demonstrate that PAVE effectively enhances pre-trained Video-LLMs with the cost of adding <1% additional FLOPs and parameters for diverse tasks, including audio-visual understanding, 3D reasoning, and multi-view video understanding, surpassing state-of-the-art task-specific models. Moreover, when applied to high frame rate videos, PAVE further improves video understanding, enhancing the performance of strong base models. Finally, our experiments show that our framework generalizes well across different Video-LLMs.
Loading