Keywords: Video Tokenization, Video Compression
TL;DR: \textbf{TL;DR:} TivTok splits videos into time-invariant tokens and time-variant tokens, achieving 2.91× compression efficiency via broadcast reuse while maintaining superior quality.
Abstract: Video tokenization is a critical bottleneck for learned video compression and generation. Existing methods often fail to adapt to the uneven information density of videos, underutilize temporal redundancy, and overlook the reusability of shared content. We present \textbf{TivTok} (\emph{Time-Invariant Tokenizer}), a transformer-based tokenizer that explicitly decouples videos into \textbf{time-invariant (TIV) tokens}, which capture global information shared across frames, and \textbf{time-variant (TV) tokens}, which encode frame-specific residual details. The encoder is designed with tailored attention masking to enforce this factorization, enabling the invariant component to capture not only static elements but also temporally coherent patterns such as consistent motion trajectories. In decoding, a broadcast mechanism reuses TIV tokens across frames, reducing complexity from quadratic to linear in video length. We further extend this approach to long videos through cross-chunk reuse, enabling scalable compression. Experiments show that TivTok improves reconstruction quality with FVD of 12.65 in the traditional $16 \times 256 \times 256$ setting and achieves a $2.91\times$ gain in compression efficiency for $128 \times 256 \times 256$ videos compared to state-of-the-art methods.
Primary Area: generative models
Submission Number: 2988
Loading