Keywords: Multimodal; Multimodal RAG; Q&A; LLM; MLLM; VLM
TL;DR: MM-BizRAG explicitly parses document structure for multimodal RAG, outperforming vision-centric baselines on enterprise QA benchmarks and introducing FastRAGEval for efficient, fine-grained answer evaluation
Abstract: Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for producing retriever embeddings and for answer generation. While efficient, this trend often neglects the rich, structured information embedded in complex enterprise documents, instead depending on pre-trained embeddings or vision-language models to implicitly capture such structure. In this work, we take a more direct approach: MM-BizRAG proactively extracts and represents document structure, applying explicit layout-aware parsing for report-type documents and leveraging page-level representations for slide-type documents, guided by a document structure-aware split.
We present MM-BizRAG, a system which distinguishes between vertically structured (e.g., reports) and horizontally structured (e.g., slide decks) documents, unifying targeted document parsing with LLM-driven artifact transformation and flexible multimodal context assembly.
Through experiments on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), we show that MM-BizRAG’s proactive parsing and artifact transformation pipeline consistently outperforms state-of-the-art vision-centric baselines, especially on report style layouts.
Furthermore, we introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall.
Submission Type: Deployed
Submission Number: 466
Loading