Position: CARE-RAG: Clinical Assessment and Reasoning in RAG

Deepthi Potluri; Aby Mammen Mathew; Alexander L. Rasgon; Jeffrey B Dewitt; Yide Hao; Joseph C McGrath; Junyuan Hong; D Jeffrey Newport; Charles Barnet Nemeroff; Greg Muller; Ying Ding

Position: CARE-RAG: Clinical Assessment and Reasoning in RAG

Deepthi Potluri, Aby Mammen Mathew, Alexander L. Rasgon, Jeffrey B Dewitt, Yide Hao, Joseph C McGrath, Junyuan Hong, D Jeffrey Newport, Charles Barnet Nemeroff, Greg Muller, Ying Ding

Published: 12 Oct 2025, Last Modified: 13 Oct 2025GenAI4Health 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: WET, PTSD, RAG, Clinical Guidelines, RAG in Health Care, Clinical Assessment, Reasoning vs. Context, Context Fidelity Evaluation, Reasoning Inference Evaluation, CARE-RAG

TL;DR: CARE-RAG tests if LLMs reason with retrieved clinical evidence using WET guidelines. Across 20 models, accuracy is high but reasoning fidelity inconsistent, underscoring the need for guardrails in safe, guideline-based care.

Abstract: Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and reasoning fidelity. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.

Submission Number: 159

Loading