Keywords: multimodal learning, reasoning, tool use
Abstract: While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require heterogeneous vision capabilities. Unfortunately, we have yet to develop methods that infuse fine-grained recognition, visual grounding, depth estimation, and 3D reasoning into a single vision-language model. Instead of forcing smaller models to learn both perception and reasoning, we propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we create and filter a large dataset of 273K high-quality synthetic reasoning traces over perceptual outputs of vision specialists. LATTE trains on this data and brings significant gains across 6 benchmarks covering both perception and reasoning abilities, compared to baselines instruction-tuned with direct answers. On the other hand, models trained by distilling both perception and reasoning from larger models lead to smaller gains or even degradation on some perception tasks. Further, our method results in a $2\$% to $5\%$% improvement on average across all benchmarks over the vanilla instruction-tuned baseline regardless of model backbones, with gains up to $16$% in MMVet.
Supplementary Material: pdf
Submission Number: 36
Loading