The Impact of Chunking Strategies on Domain-Specific Information Retrieval in RAG Systems

Maximilian Stäbler, Steffen Turnbull, Tobias Müller, Chris Schlueter Langdon, Jorge Marx-Gómez, Frank Köster

Published: 2025, Last Modified: 13 Nov 2025COINS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We benchmark 90 chunker–model configurations across seven arXiv domains (2520 retrieval runs) and show that a sentence-based splitter with a 512-token window and 200-token overlap reaches the highest token-level Intersection-over-Union (IoU ≈ 0.099) while remaining compute-efficient. Our study systematically pairs seven open-source embedding models with semantic and fixed-size chunking strategies, measuring their impact on retrieval quality and latency in Retrieval-Augmented Generation (RAG) pipelines. Results reveal that (i) sentence splitting consistently outperforms alternative heuristics, (ii) smaller embeddings deliver more stable cross-domain performance than larger ones, and (iii) finance texts benefit most, whereas astrophysics lags. The accompanying code provides practitioners with empirically grounded guidelines for selecting chunking–embedding combinations that balance accuracy and efficiency.

External IDs:dblp:conf/coins/StablerTMLGK25