Abstract: Retrieval-augmented generation (RAG) systems aim to improve the reliability of answers by incorporating information from external sources. The value of RAG depends on how well the knowledge base meets users' information needs. However, most existing evaluation methods for RAG pipelines focus on the quality of the generated answers or the precision of the retriever, without assessing whether the knowledge base itself contains the needed information. RAG benchmarks are typically created by generating questions directly from the documents in the knowledge base, which may not reflect the diversity of real user questions. We introduce GapView, a framework for evaluating whether the knowledge base in a RAG pipeline provides sufficient coverage to support expected user questions. GapView uses cosine similarity between embeddings and 2D Multi-Dimensional Scaling (MDS) projections to check whether a question is semantically aligned with any document in the corpus. We evaluated it on six synthetic datasets from clinical and programming domains. Results show that GapView achieves high F1 scores ($\geq 0.93$) in predicting coverage and reveals domain-specific performance differences. Unlike traditional RAG metrics, GapView identifies knowledge gaps and provides clear visualizations that reveal where information is missing. Our findings highlight the importance of validating knowledge base coverage in RAG pipelines and offer a scalable method for flagging unsupported questions before they go through the RAG pipeline.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: retrieval-augmented generation, evaluation methodologies, automatic evaluation of datasets, benchmarking, question answering, interpretability
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: N/A
B1 Elaboration: We created all artifacts used in the paper; see Section 3 for details. Artifacts will be released on acceptance of manuscript.
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: N/A
C Computational Experiments: Yes
C1 Model Size And Budget: N/A
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: See Section 4 for details on the experimental setup
C3 Descriptive Statistics: Yes
C3 Elaboration: See Section 5 for details on the experimental setup
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: The instructions given to annotators are described in Section 4.2
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Generative AI was used for grammar assistance; the authors take full responsibility for the content.
Author Submission Checklist: yes
Submission Number: 1014
Loading