Keywords: Vision-Language Models, Efficient Reasoning, Topology-Aware Alignment, Out-of-Distribution GeneralizationE
Abstract: Vision-language models (VLMs) often rely on chain-of-thought (CoT) reasoning, resulting in verbose and suboptimal outputs on complex tasks. We introduce STELAR-Vision, a topology-aware training framework using TopoAug to generate diverse reasoning structures (Chain, Tree, Graph). Combined with supervised fine-tuning, reinforcement learning, and Frugal Learning, it improves both accuracy and efficiency—boosting Qwen2VL by 9.7%, surpassing Qwen2VL-72B by 7.3%, and outperforming Phi-4 and LLaMA-3.2 on five OOD benchmarks by up to 28.4% and 13.2%. We’ve released datasets, and code will be available.
Submission Number: 168
Loading