Abstract: Vision-language models (VLMs) are constrained by limited high quality fine-tuning data for question-answering. We introduce VisCon-100K, a dataset of 100K image conversations derived from 45K web documents. Using GPT-4V for image-contextual captions and OpenChat 3.5 to create diverse question-answer pairs, we enhance VLM performance across multiple benchmarks. Our approach leverages accompanying web context, outperforming methods focusing solely on fine-grained visual content. We also find that a “leaky modality mix,” where questions are answerable from both the image and its contextual caption, yields superior results. VisCon-100K shows strong performance with two VLM approaches: a text-only LLM aligned with a vision encoder (ShareGPT4V 7b) and a multimodally pretrained LLM (IDEFICS2 8b). We release both the VisCon-100K dataset (https://huggingface.co/datasets/tiiuae/viscon-100k) and a contextual captioner (https://huggingface.co/tiiuae/viscon-contextual-captioner) trained on it to facilitate scalable fine-tuning data generation for future research and open-source applications. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset (https://huggingface.co/datasets/tiiuae/viscon-1m).
Loading