Incorporating LLM Versus LLM Into Multimodal Chain-of-Thought for Fine-Grained Evidence Generation

Published: 2025, Last Modified: 23 Jan 2026IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal Chain-of-Thought (MCoT) has become an effective strategy for enhancing multimodal large-language models (MLLMs) by breaking down complex tasks into sequential reasoning steps. Despite its interpretability benefits, MCoT often encounters difficulties with fine-grained semantic grounding, particularly when reasoning involves small objects, subtle attributes, or visually complex scenes that can lead to inaccuracies. Existing attempts to address these issues primarily fall into two categories: fine-tuning, which depends on large annotated datasets and costly parameter updates; and in-context learning (ICL), which achieves few-shot or zero-shot reasoning without model modification. Although ICL provides flexibility and adaptability, it is prone to semantic drift caused by an unstable prompt quality. To overcome these limitations, this study presents an entity-level evidence generation and verification framework using the ICL paradigm. This approach first produces MCoT from multimodal inputs, followed by extraction of key entities with enriched evidential descriptions. These entities were then cross-validated through adversarial checks using multiple MLLMs, and the verified evidence was integrated back into the reasoning chain. Experiments demonstrated consistent performance gains: on ScienceQA, the accuracy improved from 82.39% to 86.04%(+3.65%) with GPT-3.5, 84.96% to 89.37%(+4.41%) with Gemini; on MathVista, the accuracy increased from 43.1% to 43.6%(+0.50%) with GPT-3.5, and from 44.7% to 45.6%(+0.90%) with Gemini. These results establish new state-of-the-art baselines and confirm the robustness and generalizability of the entity-level verification for multimodal reasoning.
Loading