ChainGeo: Enabling Effective Geometric Reasoning in Small VLMs through Interleaved Visual-Text Chains
Keywords: VLM, Interleaved Reasoning, Small-Scale Models, Visual grounding
Abstract: Solving geometric problems requires linking visual perception with symbolic reasoning. However, small Vision-Language Models (VLMs) often fail to keep this connection. We introduce ChainGeo, a novel framework that enables small VLMs to perform complex geometric reasoning through interleaved visual-text chains. Our approach represents geometric elements as specialized tokens (e.g., [Point A], [Line AB]) that maintain explicit grounding in diagram regions, and act as bridges between visual features and symbolic reasoning. We further propose step-level consistency distillation to transfer complete reasoning processes from large teacher models, enforcing visual-textual coherence at each step. Experiments on GeoQA+ (72.1%), Geometry3K (64.7%), and We-Math (68.2%) show that our 2.7B model achieves performance comparable to GPT-4V while providing interpretable, grounded reasoning chains. In human evaluations, our model grounded visual references more accurately (75.3%) and reduced hallucinations by 36.6% compared with text-only baselines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10450
Loading