everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
Unified vision large language models (VLLMs) have shown remarkable progress in both multimodal understanding and generation, enabling tasks such as visual question answering and image generation. However, existing datasets often fall short of fully leveraging the synergistic potential between these two capabilities, thereby limiting the performance of unified VLLMs. To address this gap, we propose a novel dataset construction framework, \textbf{UnifiedVisual}, and introduce \textbf{UnifiedVisualData}, a high-quality dataset designed to enhance the mutual reinforcement between multimodal understanding and generation. UnifiedVisualData integrates both visual and textual inputs and outputs, fostering holistic multimodal reasoning and precise text-guided image generation. Moreover, the dataset demonstrates significant diversity in tasks and data sources, effectively addressing key limitations of existing datasets. To validate the effectiveness of UnifiedVisualData, we trained a unified VLLM, Anole-UnifiedVisual, which consistently outperforms models trained on existing datasets across a wide range of tasks. Notably, our model exhibits significant mutual enhancement between multimodal understanding and generation, underscoring the advantages of our framework. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential.