Keywords: multimodal large language model, visual tokenizer
Abstract: The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details, effectively turning their inherent conflict into a synergistic relationship. As a result, DualToken sets a new record of 0.25 rFID and 82.0\% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method outperforms VILA-U by 5.8% on average across ten visual understanding benchmarks and achieves a 10% improvement on GenAI-Bench. Notably, incorporating dual visual tokens consistently outperforms the use of a single token type in both understanding and generation tasks. We hope our research can offer a new perspective on leveraging dual visual vocabularies for unified vision-language understanding and generation models.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 16577
Loading