Do They Really Know? Evaluating Large Language Models’ Ability to Reference and Cite Oncology Guidelines

Pietro Belligoli, Timothy A. Miller, Danielle Bitterman

Published: 24 Jun 2025, Last Modified: 03 May 2026Artificial Intelligence in Medicine: 23rd International Conference, AIME 2025EveryoneCC BY 4.0

Abstract: Large language models (LLMs) hold significant promise in clinical decision support by generating evidence-based recommendations, particularly in complex domains like breast cancer. This study investigates whether LLMs possess specific knowledge of restricted oncology guidelines (NCCN) and open-access guidelines (ASCO and ESMO) by evaluating their performance on 50 synthetic breast cancer case vignettes. Two proprietary models (GPT-4 and Claude-3.5-Sonnet) and two open-source models (LLaMA-3.2 3B and Mistral-7B) were prompted to generate treatment recommendations by retrieving the exact citations they referenced to create recommendations. References were manually evaluated and classified as exact matches, paraphrased, or hallucinated. Although none of the models successfully retrieved verbatim quotes, GPT-4 generated citations that reflected the content of the NCCN, ASCO, and ESMO guidelines in 90%, 64%, and 70%, respectively. Claude-3.5-Sonnet performed similarly, with 80% for NCCN, 84% for ASCO, and 88% for ESMO. In contrast, LLaMA-3.2 3B showed weaker performance, referring to NCCN, ASCO, and ESMO in 26%, 28%, and 50% of cases, respectively. Mistral-7B performed comparably to LLaMA-3.2 in NCCN (14%) but achieved higher rates for ASCO (68%) and ESMO (84%). As LLMs evolve, ensuring consistent output, accurate citations, and reliable reference of clinical guidelines will be essential for their integration into clinical decision support systems.