Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

ACL ARR 2025 July Submission44 Authors

18 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries. We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches while maintaining semantic coherence and structural integrity. Our method processes documents in configurable page batches with cross-batch context preservation, enabling accurate handling of tables spanning multiple pages, embedded visual elements, and procedural content. We evaluate our approach on our internal benchmark dataset of diverse PDF documents, demonstrating improvements in chunk quality and downstream RAG performance. Our vision-guided approach achieves better quantitative performance on our internal benchmark compared to traditional vanilla RAG systems, with qualitative analysis showing better preservation of document structure and semantic coherence.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Information Extraction, Information Retrieval and Text Mining, Multimodality and Language Grounding to Vision, Robotics and Beyond, NLP Applications, Summarization, Syntax: Tagging, Chunking and Parsing

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4.1

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: N/A

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: While our approach utilizes computational experiments through API calls to external providers (Gemini-2.5-Pro, GPT-4.1, GPT-4.1-mini, and OpenAI text-embedding-3-small), we do not report specific model parameter counts, GPU hours, or detailed infrastructure specifications as these are proprietary to the API providers (Google, Anthropic, AWS Bedrock, OpenAI). Our experimental methodology focuses on the document processing pipeline and RAG system performance evaluation rather than computational resource analysis. The models used are documented in Section 4.2, and our evaluation methodology is described in Section 5.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Yes, experimental setup and hyperparameters are discussed in Section 4. Specifically, Section 4.2 reports the temperature setting (T = 0.1) used for consistent chunk generation with Gemini-2.5-Pro. Section 4.3 covers prompt engineering methodology, including the iterative refinement process for optimal chunk generation. The complete prompt design is provided in Appendix A.1. While traditional hyperparameter search was not applicable given our focus on document processing pipeline design, we detail the key configuration parameters that affect system performance.

C3 Descriptive Statistics: No

C3 Elaboration: No, we do not report descriptive statistics such as error bars, confidence intervals, or summary statistics from multiple experimental runs. Our evaluation presents single-run results comparing Vision-Guided RAG (0.89 accuracy) versus Vanilla RAG (0.78 accuracy) as reported in Table 1, Section 6.2. The focus of our work is on demonstrating the methodological improvements in document chunking quality and system architecture rather than statistical significance testing. Our evaluation methodology using GPT-4.1-mini as an automated judge (Section 5.3) provides qualitative assessment of chunk quality improvements, but we acknowledge that multiple runs with statistical analysis would strengthen the quantitative claims.

C4 Parameters For Packages: Yes

C4 Elaboration: Yes, implementation details and parameter settings for external packages are reported in Section 5.1. We specify the use of OpenAI text-embedding-3-small for document embedding, Elasticsearch for vector database storage, and top-k similarity search with k=10 for retrieval. Additionally, Section 4.2 reports the specific models used (Gemini-2.5-Pro, GPT-4.1, GPT-4.1-mini) with their configuration settings including temperature (T=0.1). While some packages used are standard implementations without customized parameters, we provide sufficient detail for reproducibility of our experimental setup.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 44

Loading