Keywords: Visual Question Answering, Retrieval-Augmented Generation, Multimodal LLM, Vision Language Models, Fine-tuning, LLaMA, Hallucination Detection, Multitask Learning, NVIDIA RAGAS, KDD Cup 2025
TL;DR: Team NVIDIA’s winning solution for KDD Cup 2025 task 2 leveraged multitask fine-tuning of Llama-3 Vision, optimized with NVIDIA RAGAS Accuracy Metric for robust multimodal RAG performance.
Abstract: The KDD Cup 2025 Task 2 focused on building multimodal RAG systems for visual question answering while minimizing hallucinations. Team NVIDIA's winning solution leveraged a fine-tuned Llama-3.2-11B-Vision-Instruct VLM to perform three critical subtasks: generating web search queries, re-ranking contexts, and producing grounded answers. The VLM was trained to output "I don't know" when retrieved information was insufficient.
A fine-tuning datamix comprising 26.5k samples was carefully curated from 2.5k competition examples using NVIDIA NIM llama-4-maverick for synthetic data generation and GPT-4o as LLM-as-a-judge. The datamix curation, VLM fine-tuning process, and hyperparameter selection were optimized using NVIDIA RAGAS Accuracy metric — a blend of three LLM judges (aka council of judges, including Nemotron) that achieves 0.92+ correlation with human judgment. Since the competition's final evaluation was determined by human judges, tuning our pipeline using RAGAS offered a strong advantage.
Pipeline responses were post-processed with an "I don't know" probability threshold optimized using the NVIDIA RAGAS Accuracy metric. Our approach achieved a final human evaluation score of 0.233, securing first place by effectively balancing answer coverage with hallucination prevention.
Submission Number: 18
Loading