InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye; Qiyuan He; Jiaqi Han; Puheng Li; Jiaojiao Fan; Zekun Hao; Fitsum Reda; Yogesh Balaji; Huayu Chen; Sheng Liu; Angela Yao; James Zou; Stefano Ermon; Haoxiang Wang; Ming-Yu Liu

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

Published: 26 Jan 2026, Last Modified: 01 Mar 2026ICLR 2026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: discrete tokenization, video representation, eficiency, information theory

TL;DR: This paper introduces InfoTok, an adaptive video tokenizer guided by information theory, which significantly boosts video compression efficiency and reduces computational overhead without degrading visual quality.

Abstract: Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 4593

Loading