Keywords: RAG, VLM, Hallucination
Abstract: Trustworthy multimodal question answering requires systems that
can fuse visual evidence, external knowledge, and dialog history
while avoiding costly mistakes by abstaining when the answer is
uncertain. The Meta Comprehensive RAG Multimodal (CRAG-MM)
Challenge 2025 evaluates this setting across staged tasks combining
wearable egocentric imagery (including Ray-Ban Meta smartglasses),
image and web retrieval, and multi-turn interaction. We
present the AcroYAMALEX system, built on Llama 3.2 Vision
Instruct with a two-stage adapter architecture: (i) a Retrieval-
Oriented LoRA adapter that first produces a concise provisional answer,
which we re-purpose as a high-precision text query for downstream
search; and (ii) an Answer-Generation LoRA adapter trained
with uncertainty-aware relabelling so the model outputs “I don’t
know” instead of hallucinating under weak evidence. Retrieved web
snippets are chunked and reranked with Qwen3-Reranker-0.6B to
provide focused context before final answer generation. In Task 2
(multi-source RAG), our approach contributed to a 3rd-place final
ranking. These results suggest that coupling abstention training
with deliberate query construction and neural reranking improves
factual reliability in multimodal RAG systems.
Submission Number: 5
Loading