UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets

ACL ARR 2025 February Submission5877 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Unified vision large language models (VLLMs) have shown remarkable progress in both multimodal understanding and generation, enabling tasks such as visual question answering and image generation. However, existing datasets often fall short of fully leveraging the synergistic potential between these two capabilities, thereby limiting the performance of unified VLLMs. To address this gap, we propose a novel dataset construction framework, \textbf{UnifiedVisual}, and introduce \textbf{UnifiedVisualData}, a high-quality dataset designed to enhance the mutual reinforcement between multimodal understanding and generation. UnifiedVisualData integrates both visual and textual inputs and outputs, fostering holistic multimodal reasoning and precise text-guided image generation. Moreover, the dataset demonstrates significant diversity in tasks and data sources, effectively addressing key limitations of existing datasets. To validate the effectiveness of UnifiedVisualData, we trained a unified VLLM, Anole-UnifiedVisual, which consistently outperforms models trained on existing datasets across a wide range of tasks. Notably, our model exhibits significant mutual enhancement between multimodal understanding and generation, underscoring the advantages of our framework. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality; cross-modal content generation

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 5877

Loading