Team Metamorphosis: Meta CRAGMM KDD CUP

Published: 20 Aug 2025, Last Modified: 01 Feb 20262025 KDD Cup CRAG-MM WorkshopEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: VLLMs, MM-RAG QA system
Abstract: Vision-Language Models (VLMs) show remarkable capabilities in multimodal reasoning but struggle with hallucinations, long-tail recognition, and grounding under real-world conditions such as those found in wearable devices. The Meta CRAG-MM Challenge 2025 presents a benchmark to evaluate these issues across three tasks: Single-source Augmentation, Multi-source Augmentation, and Multi-turn QA. We present a modular MM-RAG pipeline that explicitly targets factuality, interpretability, and robustness in such scenarios. Our system combines object-aware image cropping, domain- specific identification via a LoRA-finetuned LLaMA 3.2 Vision- Instruct model [4 ], CLIP- and BGE-based dual retrieval pipelines, and structured Chain-of-Thought (CoT) reasoning followed by a hallucination-sensitive summarizer. We observe that visual prepro- cessing and identification significantly improve retrieval quality, while CoT prompting enhances answer consistency. Early results show reduced hallucination and improved truthfulness over a base- line LLM setup. We outline remaining challenges and future di- rections, including contrastive answer alignment, domain general- ization, and inference optimization, toward developing real-time, grounded VLLMs for egocentric vision tasks.
Submission Number: 16
Loading