Keywords: Unified Model in Understanding and Generation
Abstract: Large-scale multimodal models have achieved remarkable progress in both understanding and generation. Traditionally, these tasks were studied in isolation, resulting in separate architectures. Recent efforts instead pursue unified multimodal models that combine heterogeneous components to support both capabilities within a single framework. However, such models introduce substantial challenges related to architectural redundancy, compute allocation, and efficient scaling.
In this work, we conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component, although essential for multimodal reasoning, exhibits notable compressibility in generation tasks. In contrast, the generation components are highly sensitive to compression, with performance degrading sharply even under moderate ratios of depth or width reduction. To address this limitation, we propose a Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed in hidden neurons. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We first demonstrate the potential of sparse activation in generation components, and then show that a fully trainable adaptation further enhances performance. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of the parameters.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22493
Loading