STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Efficient Reasoning, Topology-Aware Alignment, Out-of-Distribution GeneralizationE
Abstract: Vision-language models (VLMs) often rely on chain-of-thought (CoT) reasoning, resulting in verbose and suboptimal outputs on complex tasks. We introduce STELAR-Vision, a topology-aware training framework using TopoAug to generate diverse reasoning structures (Chain, Tree, Graph). Combined with supervised fine-tuning, reinforcement learning, and Frugal Learning, it improves both accuracy and efficiency—boosting Qwen2VL by 9.7%, surpassing Qwen2VL-72B by 7.3%, and outperforming Phi-4 and LLaMA-3.2 on five OOD benchmarks by up to 28.4% and 13.2%. We’ve released datasets, and code will be available.
Submission Number: 168
Loading