SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
Abstract: This paper presents the Semantic-aWarE spatialtEmporal Tokenizer (SweetTok), a novel video tokenizer
to overcome the limitations in current video tokenization
methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual
patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via Decoupled Query AutoEncoder (DQAE).
This design allows SweetTok to efficiently compress video
token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a Motion-enhanced
Language Codebook (MLC) tailored for spatial and temporal compression to address the differences in semantic
representation between appearance and motion information. SweetTok significantly improves video reconstruction
results by 42.8% w.r.t rFVD on UCF-101 dataset. With
a better token compression strategy, it also boosts downstream video generation results by 15.1% w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with
semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
Loading