Benchmarking LLMs for Pediatric Gastroenterology Knowledge Tasks

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Large language models, pediatric gastroenterology, clinical reasoning, diagnostic accuracy, medical AI, expert evaluation, R-IDEA, clinical decision support, medical language models, pediatric GI
Abstract: In recent years, large language models (LLMs) have demonstrated remarkable capabilities across a wide range of medical specialties, showing promise in clini- cal decision support and diagnosis. However, there is limited published research on the acceptance, clinical applications, and outcomes associated with LLMs in pediatric gastroenterology, this paper addresses this gap by describing a com- prehensive evaluation of Large Language Models (LLMs) on real-world pedi- atric gastroenterology (Peds GI) scenarios. The goal is to systematically assess the diagnostic accuracy, clinical reasoning, Potential harm and factual correct- ness of multiple LLMs—including proprietary and open-source models—in a domain that remains underexplored.We used various evaluation metrics to com- pare performance across general-purpose models (GPT-4o, Grok, DeepSeek, and LLaMA-4) and domain-fine-tuned models (OpenEvidence). In doing so, we aim to gain insights into how these models might support clinical decision-making in pediatric gastroenterology and identify areas requiring further refinement. We conducted a human expert evaluation of model outputs using the Revised-IDEA (R-IDEA) rubric and twelve qualitative axes assessing reasoning, comprehen- sion, scientific consensus, and potential harm. Our results show that DeepSeek- V2 achieved the highest mean R-IDEA score, with LLaMA-4, GPT-4o, Grok, and OpenEvidence ranking second through fifth, respectively. The twelve-axis evaluation further revealed that DeepSeek-V2, LLaMA-4, and GPT-4o consis- tently produced clinically safe, consensus-aligned, and well-reasoned responses across most pediatric gastroenterology cases, whereas OpenEvidence and Grok showed greater variability and higher rates of incorrect reasoning, omission, and bias. Overall, the reasoning, comprehension, and retrieval axes demonstrated the strongest performance, while bias, inappropriate content, and consensus alignment remained key areas for improvement.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 128
Loading