VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness

Yang Minghao; Zechen Bai; Jing Lin; Haoqian Wang; Alex Jinpeng Wang

VaporTok: RL-Driven Adaptive Video Tokenizer with Prior & Task Awareness

Yang Minghao, Zechen Bai, Jing Lin, Haoqian Wang, Alex Jinpeng Wang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adaptive video tokenizer, Autoregressive video generation, Probabilistic taildrop, Reinforcement learning

TL;DR: To train an adaptive video tokenizer, we introduce probabilistic taildrop to inject visual complexity prior to the tokenizer and incorporate GRPO for post-training, which further boosts efficiency in a task-aware adaptive manner.

Abstract: Recent advances in visual tokenizers have demonstrated their effectiveness for multimodal large language models and autoregressive generative models. However, most existing visual tokenizers rely on a fixed downsampling rate at a given visual resolution, and consequently produce a constant number of visual tokens, ignoring the fact that visual information of varying complexity warrant different token budgets. Motivated by this observation, we propose an adaptive video tokenizer "VaporTok" with two core contributions:Probabilistic Taildrop: We introduce a novel taildrop mechanism that learns a truncation index sampling distribution conditioned on visual complexity of the video. During both training and inference, the decoder reconstructs videos at adaptive token lengths, allocating more tokens to complex videos and fewer to simpler ones. Parallel Sample GRPO with Vapor Reward: By leveraging the probability distribution produced by probabilistic taildrop, we reformulate the visual tokenization pipeline as a sequential decision process. To optimize this process, we propose a variant of GRPO and a composite reward encompassing token efficiency, reconstruction fidelity, and generative quality, thus enabling metrics-aware adaptive tokenization across diverse objectives. Extensive experiments on standard video generation benchmarks confirm our analysis, showing that our adaptive approach matches or outperforms fixed‐rate baselines and naive taildrop while using fewer tokens.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 12546

Loading