Keywords: retrieval-augmented generation; reinforcement learning; large language models ; multimodal models
TL;DR: LumiRAG: A unified multimodal RAG system that achieves state-of-the-art performance across text and visual retrieval tasks through progressive instruction tuning and cross-modal reinforcement learning
Abstract: Retrieval-Augmented Generation (RAG) enhances large language models' ability to leverage external knowledge. However, existing models remain limited in their unified understanding and generation of text and multimodal retrieved content. We present LumiRAG, a suite of Qwen2.5-based models achieving strong RAG capabilities across modalities through systematic fine-tuning with high-quality data. Our approach comprises three key components: (1) Human-synthetic hybrid dataset with adaptive domain harvesting, dual-source generation, and multi-layer quality control, producing 520K samples across text RAG, multimodal tasks, and expert-annotated dialogues; (2) Three-stage progressive instruction tuning that unifies supervised fine-tuning, context-augmented instruction tuning, and reinforcement learning with Optimized-DAPO for stepwise performance alignment; (3) Cross-modal reinforcement learning framework employing reward shaping and stabilized training to jointly optimize retrieval accuracy and generation quality. Extensive evaluations on ChatRAG-Bench, long-form summarization benchmarks (CNN/DailyMail, XSum), MMRAG-Bench, and MMTAB demonstrate that LumiRAG substantially outperforms open-source and proprietary baselines, establishing new state-of-the-art performance across diverse modalities and task types. Model weights, datasets, and evaluation code will be open-sourced to support reproducibility and future research.
Primary Area: generative models
Submission Number: 10511
Loading