TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

TempFlex: Advancing MLLMs with Temporal Perception and Natively Scalable Resolution Encoding

TMLR Paper5478 Authors

27 Jul 2025 (modified: 09 Nov 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models (MLLMs) have made significant progress across vision-language tasks, yet many designs still suffer from two core limitations. (i) Excessive visual tokens and broken global context: Tiled Patch Encoding fragments high-resolution images, leading to token overload and disrupting global attention modeling. (ii) Lack of temporal reasoning: Most models process video as independent frames using static image encoders, failing to capture temporal dynamics. We present TempFlex-VL, a token-efficient and temporally aware MLLM that addresses both issues through lightweight architectural enhancements. First, we introduce a resolution-agnostic visual encoder that directly processes full images without tiling, preserving global context while substantially reducing visual tokens. Second, we propose Temporal Fiber Fusion (TFF), a plug-and-play module with three complementary pathways: (1) a dynamic local-convolution branch for fine-grained motion, (2) a gated memory accumulator for long-term dependencies, and (3) a periodic encoder for modeling cyclic patterns. These signals are softly fused, enabling the model to adapt to diverse temporal structures without overfitting. To support large-scale video-language pretraining, we curate TempFlex-2M, a high-quality synthetic video–text corpus generated in a single stage via GPT-4o with direct visual prompting. We instantiate TempFlex-VL using two different language backbones, Gemma3-4B and Qwen3-4B, demonstrating the generality of our design across architectures. Both variants achieve state-of-the-art or competitive results on a wide range of image and video benchmarks while markedly improving token efficiency. We will release all code, models, and data to spur future research in unified multimodal understanding.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Weicheng_Kuo1

Submission Number: 5478

Loading