Abstract: We benchmark 90 chunker–model configurations across seven arXiv domains (2520 retrieval runs) and show that a sentence-based splitter with a 512-token window and 200-token overlap reaches the highest token-level Intersection-over-Union (IoU ≈ 0.099) while remaining compute-efficient. Our study systematically pairs seven open-source embedding models with semantic and fixed-size chunking strategies, measuring their impact on retrieval quality and latency in Retrieval-Augmented Generation (RAG) pipelines. Results reveal that (i) sentence splitting consistently outperforms alternative heuristics, (ii) smaller embeddings deliver more stable cross-domain performance than larger ones, and (iii) finance texts benefit most, whereas astrophysics lags. The accompanying code provides practitioners with empirically grounded guidelines for selecting chunking–embedding combinations that balance accuracy and efficiency.
External IDs:dblp:conf/coins/StablerTMLGK25
Loading