Abstract: In this paper, we introduce the Multi-Modal Bilingual Instruction Tuning dataset (M2BIT), specifically designed to enhance the performance of vision language models (VLMs). Our M2BIT dataset is one of the largest multi-modal instruction tuning datasets available, covering 40 diverse vision-language tasks in both English and Chinese. It comprises 2 million instances, each accompanied by 400 manually written task instructions. With a carefully curated annotation process, we strive to elevate the quality of response, thereby enriching the user experience while minimizing the generation of potential hallucinations. To validate the efficacy of M2BIT, we train a VLM known as Ying-VLM using this dataset, delving into the impact of instruction tuning across diverse languages and modalities. Upon comparing it with strong VLM baselines, Ying-VLM demonstrates superior performance on complex knowledge vision question answering tasks. Moreover, it exhibits a lower propensity for hallucination, displays greater generalization capabilities to previously unseen video tasks, and better comprehends novel instructions in Chinese. We will open-source the M2BIT dataset and trained models to facilitate future research.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Data resources
Languages Studied: English, Chinese
0 Replies
Loading