Keywords: VLLMs, MM-RAG QA system
Abstract: Vision-Language Models (VLMs) show remarkable capabilities in
multimodal reasoning but struggle with hallucinations, long-tail
recognition, and grounding under real-world conditions such as
those found in wearable devices. The Meta CRAG-MM Challenge
2025 presents a benchmark to evaluate these issues across three
tasks: Single-source Augmentation, Multi-source Augmentation,
and Multi-turn QA. We present a modular MM-RAG pipeline that
explicitly targets factuality, interpretability, and robustness in such
scenarios. Our system combines object-aware image cropping, domain-
specific identification via a LoRA-finetuned LLaMA 3.2 Vision-
Instruct model [4 ], CLIP- and BGE-based dual retrieval pipelines,
and structured Chain-of-Thought (CoT) reasoning followed by a
hallucination-sensitive summarizer. We observe that visual prepro-
cessing and identification significantly improve retrieval quality,
while CoT prompting enhances answer consistency. Early results
show reduced hallucination and improved truthfulness over a base-
line LLM setup. We outline remaining challenges and future di-
rections, including contrastive answer alignment, domain general-
ization, and inference optimization, toward developing real-time,
grounded VLLMs for egocentric vision tasks.
Submission Number: 16
Loading