RM-LLaVA: An Efficient Training-Free Video Understanding Framework with Redundancy-Minimized Frame Selection

Chuqin Zhou

RM-LLaVA: An Efficient Training-Free Video Understanding Framework with Redundancy-Minimized Frame Selection

Chuqin Zhou

Published: 11 Nov 2025, Last Modified: 16 Jan 2026DAI PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: training-free video understanding, vision-language model, frame selection, redundancy minimization, semantic diversity

TLDR: A training-free video understanding framework that selects diverse and non-redundant frames to enhance vision-language reasoning without fine-tuning.

Abstract: Vision-Language Models (VLMs) have made significant advances in image-language tasks, and many recent studies are now exploring their application to video understanding. However, video frames generate a large number of visual tokens, making efficient frame selection crucial. Some existing methods employ relatively simple frame sampling strategies, which often result in redundant frame selection and insufficient temporal coverage when handling complex and dynamic video content. To address this issue, we propose RM-LLaVA, a training-free video-language understanding framework based on Redundancy-Minimized Frame Selection (RMFS). RMFS consists of two stages: (i) structure-aware clustering to refine the candidate frame pool, and (ii) iterative semantic diversity maximization to select a compact and informative set of frames. The proposed method requires neither training nor fine-tuning, and can be seamlessly applied to off-the-shelf image-language models. We conduct extensive experiments on three standard VideoQA benchmarks, demonstrating that RM-LLaVA outperforms existing training-free approaches on two of them and surpasses many fine-tuned models. Ablation studies further verify the effectiveness and complementarity of the proposed components. The code will be released upon acceptance.

Submission Number: 52

Loading