Abstract: Recent advances in video large language models (VLLMs) have enabled strong zero-shot reasoning, yet most systems still rely on uniform frame sampling and fragile GPU execution pipelines. We propose MAS-LLaVA, a training-free enhancement to LLaVA that introduces uniform frame sampling, adaptive importance-based token sampling, and deviceconsistent inference. First, uniform frame sampling ensures balanced temporal coverage by selecting equidistant frames, maintaining compatibility with existing aggregation strategies. Second, an adaptive token selection module computes the feature magnitude and diversity of patch tokens, assigning probabilistic importance scores and sampling a compact subset of visual tokens under a fixed token budget. Third, a device-aware execution pipeline ensures that all intermediate tensors inherit the input frame's device and data type, allowing uniform and adaptive sampling strategies to run reliably across heterogeneous GPUs. Experiments on NExT-QA and IntentQA show that MAS-LLaVA improves accuracy and stability across diverse video inputs without any retraining. These findings demonstrate that smarter, training-free sampling and inference design can substantially improve both the practicality and robustness of VLLMs in realworld video-language understanding.
Loading