Keywords: Autoregressive video generation, discrete tokens, VQ-VAE, entropy, top-p / nucleus sampling, top-k, adaptive decoding, error accumulation, uncertainty-aware sampling
TL;DR: We propose ENkG, an entropy-guided sampling policy with a k-guard that mitigates error accumulation and preserves structure in long-horizon autoregressive video generation—plug-and-play at inference, no retraining.
Abstract: Autoregressive (AR) architectures have achieved significant successes in LLM, inspiring explorations for video generation. In LLMs, top-$p$/top-$k$ sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strike a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually
severely degrade long-horizon quality.
To address this, we propose Entropy-Guided $k$-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution.
ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding.
ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11846
Loading