Benchmarking LLMs for Pediatric Gastroenterology Knowledge Tasks
Keywords: Large language models, pediatric gastroenterology, clinical reasoning, diagnostic accuracy, medical AI, expert evaluation, R-IDEA, clinical decision support, medical language models, pediatric GI
Abstract: In recent years, large language models (LLMs) have demonstrated remarkable
capabilities across a wide range of medical specialties, showing promise in clini-
cal decision support and diagnosis. However, there is limited published research
on the acceptance, clinical applications, and outcomes associated with LLMs in
pediatric gastroenterology, this paper addresses this gap by describing a com-
prehensive evaluation of Large Language Models (LLMs) on real-world pedi-
atric gastroenterology (Peds GI) scenarios. The goal is to systematically assess
the diagnostic accuracy, clinical reasoning, Potential harm and factual correct-
ness of multiple LLMs—including proprietary and open-source models—in a
domain that remains underexplored.We used various evaluation metrics to com-
pare performance across general-purpose models (GPT-4o, Grok, DeepSeek, and
LLaMA-4) and domain-fine-tuned models (OpenEvidence). In doing so, we aim
to gain insights into how these models might support clinical decision-making in
pediatric gastroenterology and identify areas requiring further refinement. We
conducted a human expert evaluation of model outputs using the Revised-IDEA
(R-IDEA) rubric and twelve qualitative axes assessing reasoning, comprehen-
sion, scientific consensus, and potential harm. Our results show that DeepSeek-
V2 achieved the highest mean R-IDEA score, with LLaMA-4, GPT-4o, Grok,
and OpenEvidence ranking second through fifth, respectively. The twelve-axis
evaluation further revealed that DeepSeek-V2, LLaMA-4, and GPT-4o consis-
tently produced clinically safe, consensus-aligned, and well-reasoned responses
across most pediatric gastroenterology cases, whereas OpenEvidence and Grok
showed greater variability and higher rates of incorrect reasoning, omission, and
bias. Overall, the reasoning, comprehension, and retrieval axes demonstrated
the strongest performance, while bias, inappropriate content, and consensus
alignment remained key areas for improvement.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 128
Loading