Faithful Visual Question Answering with Chain-of-Thought and Retrieval-Augmented Reasoning

Kenichiro Miyaki; Yukiko Yoshikawa; Meisaku Suzuki; Yusuke Fukushima; Masato Hashimoto

Faithful Visual Question Answering with Chain-of-Thought and Retrieval-Augmented Reasoning

Kenichiro Miyaki, Yukiko Yoshikawa, Meisaku Suzuki, Yusuke Fukushima, Masato Hashimoto

Published: 20 Aug 2025, Last Modified: 01 Feb 20262025 KDD Cup CRAG-MM WorkshopEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: KDD Cup, Comprehensive RAG Benchmark, Multimodal Question Answering, Retrieval-Augmented Generation, Vision-Language Models, Chain-of-Thought Reasoning

TL;DR: We present a unified MM-RAG system that achieves faithful and efficient visual question answering via structured CoT training, retrieval-augmented inference, and self-verification.

Abstract: We propose a unified Retrieval-Augmented Generation (RAG) architecture that ensures factual consistency, coherent outputs, and computational efficiency for multimodal question answering. Our method was developed for the CRAG-MM Challenge at KDD Cup 2025 and is designed to generate reliable answers by integrating external knowledge into both visual and language-based queries. Our system is based on LLaMA 3.2 11B Vision-Instruct and is enhanced with a structured reasoning module and a self-verification mechanism. We constructed approximately 4,000 supervised training instances using a four-stage pipeline involving Chain-of-Thought (CoT) generation, GPT-based evaluation, and rewriting. For fine-tuning, we applied a lightweight combination of LoRA and DoRA. At inference time, the model generates a search query from the image and question, retrieves relevant external context, and produces a structured output in the format of <Reasoning> + <Answer>. To suppress hallucinations and improve trustworthiness, we incorporated LLM-based consistency checking and ambiguity detection. The system runs on vLLM and supports fast batch inference with up to 12 parallel samples under a 10-second latency limit. For Tasks 2 and 3, we reused the same model and overall pipeline as in Task 1. While training data is similarly built in four stages, Tasks 2 and 3 omit the reasoning component and use a lightweight evaluation process focused only on semantic correctness. We also leverage auxiliary context retrieved from the web to boost answer accuracy. Task 3 extends the pipeline to handle multi-turn dialogue by incorporating conversation history into the input. Our method achieved top-tier performance across all tasks, winning the Multi-hop (5.9%) and Reasoning (10.3%) categories, as well as the Special Question Category award. These results demonstrate the effectiveness, robustness, and scalability of our proposed architecture.

Submission Number: 15

Loading