Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung; Tyler Zhu; Max Gonzalez Saez-Diez; Juan Carlos Niebles; Honglu Zhou; Olga Russakovsky

Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung, Tyler Zhu, Max Gonzalez Saez-Diez, Juan Carlos Niebles, Honglu Zhou, Olga Russakovsky

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a VideoLLM approach which fuses multiple visual encoders effectively and combines their specialized knowledge into one model.

Abstract: Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.

Lay Summary: When researchers design AI chatbots that understand videos, they often use a single pre-existing vision-understanding module. However, we found that such modules are often specialized, e.g. for understanding colors but not movements, or vice versa. Extending their knowledge in a simple manner proved difficult. We design a method of harnessing multiple specialized vision modules together, even if each was created with different designs and goals. Our AI chatbot, MERV, answers a greater breadth of video-related questions as each module can cover for the other's weaknesses. As one of the first papers to work on multi-module chatbots, we hope this demonstrates the potential of having multiple vision-understanding modules in one system.

Link To Code: https://github.com/princetonvisualai/merv

Primary Area: Applications->Computer Vision

Keywords: video understanding, multimodal

Submission Number: 6963

Loading