Abstract: The remarkable performance of Multimodal Large Language Models (MLLMs) has demonstrated their proficient understanding capabilities in handling various visual tasks. Nevertheless, the opaque nature of black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate reasoning tasks is also constrained, culminating in stagnation of progression. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness. Through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of Fact across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability and reducing hallucinations owing to its high correlation between images and text.
Loading