LumiRAG: A Unified Multimodal RAG Large Model Bridging Text and Image Retrieval

Chao Wang; Shaohua Wu; Lingjun Li; Xinjing Wang; Chong Shen; Xi Chen

LumiRAG: A Unified Multimodal RAG Large Model Bridging Text and Image Retrieval

Chao Wang, Shaohua Wu, Lingjun Li, Xinjing Wang, Chong Shen, Xi Chen

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: retrieval-augmented generation; reinforcement learning; large language models ; multimodal models

TL;DR: LumiRAG: A unified multimodal RAG system that achieves state-of-the-art performance across text and visual retrieval tasks through progressive instruction tuning and cross-modal reinforcement learning

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models' ability to leverage external knowledge. However, existing models remain limited in their unified understanding and generation of text and multimodal retrieved content. We present LumiRAG, a suite of Qwen2.5-based models achieving strong RAG capabilities across modalities through systematic fine-tuning with high-quality data. Our approach comprises three key components: (1) Human-synthetic hybrid dataset with adaptive domain harvesting, dual-source generation, and multi-layer quality control, producing 520K samples across text RAG, multimodal tasks, and expert-annotated dialogues; (2) Three-stage progressive instruction tuning that unifies supervised fine-tuning, context-augmented instruction tuning, and reinforcement learning with Optimized-DAPO for stepwise performance alignment; (3) Cross-modal reinforcement learning framework employing reward shaping and stabilized training to jointly optimize retrieval accuracy and generation quality. Extensive evaluations on ChatRAG-Bench, long-form summarization benchmarks (CNN/DailyMail, XSum), MMRAG-Bench, and MMTAB demonstrate that LumiRAG substantially outperforms open-source and proprietary baselines, establishing new state-of-the-art performance across diverse modalities and task types. Model weights, datasets, and evaluation code will be open-sourced to support reproducibility and future research.

Primary Area: generative models

Submission Number: 10511

Loading