Quantifying the Gap between Understanding and Generation within Unified Multimodal Models
Abstract: Recent advances in unified models have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities. Each question in GapEval can be answered in both modalities, enabling a symmetric evaluation of a model's bidirectional inference and cross-modal consistency. Experimental results reveal a persistent performance gap between the two directions across a wide range of unified models with different architectures, suggesting that current models achieve only engineering-level unification rather than truly cognitive-level integration. To further explore the underlying mechanism, we introduce Unified Knowledge and conduct an empirical study from the perspective of knowledge manipulation. Our findings indicate that understanding and generation often rely on separate knowledge representations. The embedded knowledge in unified models tends to co-exist rather than be integrated across modalities, highlighting a key limitation in current multimodal alignment.
Loading