Improving Truthfulness in Multimodal RAG: A Dual-Adapter Vision Large Language Model Approach

Hiroki Yamamoto; Shin Higuchi; Takashi Sasaki; Shun Yoshioka

Improving Truthfulness in Multimodal RAG: A Dual-Adapter Vision Large Language Model Approach

Hiroki Yamamoto, Shin Higuchi, Takashi Sasaki, Shun Yoshioka

Published: 20 Aug 2025, Last Modified: 01 Feb 20262025 KDD Cup CRAG-MM WorkshopEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: RAG, VLM, Hallucination

Abstract: Trustworthy multimodal question answering requires systems that can fuse visual evidence, external knowledge, and dialog history while avoiding costly mistakes by abstaining when the answer is uncertain. The Meta Comprehensive RAG Multimodal (CRAG-MM) Challenge 2025 evaluates this setting across staged tasks combining wearable egocentric imagery (including Ray-Ban Meta smartglasses), image and web retrieval, and multi-turn interaction. We present the AcroYAMALEX system, built on Llama 3.2 Vision Instruct with a two-stage adapter architecture: (i) a Retrieval- Oriented LoRA adapter that first produces a concise provisional answer, which we re-purpose as a high-precision text query for downstream search; and (ii) an Answer-Generation LoRA adapter trained with uncertainty-aware relabelling so the model outputs “I don’t know” instead of hallucinating under weak evidence. Retrieved web snippets are chunked and reranked with Qwen3-Reranker-0.6B to provide focused context before final answer generation. In Task 2 (multi-source RAG), our approach contributed to a 3rd-place final ranking. These results suggest that coupling abstention training with deliberate query construction and neural reranking improves factual reliability in multimodal RAG systems.

Submission Number: 5

Loading