Abstract: The remarkable performance of Multimodal Large Language Models (MLLMs) has unequivocally demonstrated their proficient understanding capabilities in handling a wide array of visual tasks. Nevertheless, the opaque nature of their black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision.
Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness.
Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Empirical evidence from experiments demonstrates the superiority of our method across models of varying parameter sizes, significantly enhancing their compositional reasoning and generalization ability. Our approach also reduces hallucinations owing to its high correlation between images and text. The anonymous project is available at: https://anonymous.4open.science/r/Fact_program-216D/
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: This work introduces Fact, a paradigm enhancing Multimodal Large Language Models (MLLMs) in multimodal processing by providing interpretable, concise, and transferable multimodal rationales. It leverages verifiable visual programming for creating executable code, ensuring the rationales' authenticity and precision. The conciseness is further improved through operations like pruning, merging, and bridging, making the model's reasoning processes more accessible. Additionally, the rationale's transferability to end-to-end distillation promotes its application across different models and tasks. Empirical validation across various model sizes demonstrates the approach's effectiveness, significantly boosting MLLMs' interpretative capabilities and setting new performance benchmarks in multimodal processing.
Supplementary Material: zip
Submission Number: 4609
Loading