Keywords: knowledge cutoff, clinical safety, large language models, healthcare NLP, model recency, benchmark dataset, medical question answering, model evaluation, trustworthiness
TL;DR: We created a 363-question medical MCQ benchmark from multiple IDSA COVID-19 guideline versions and show that a model’s training date, or knowledge cutoff, can have catastrophic effects on clinical accuracy.
Abstract: Modern clinical decisions increasingly depend on large language models (LLMs). Yet, these models are built on static training data that end long before deployment. This temporal gap between training and use, commonly described as a knowledge cutoff, creates a hidden yet critical failure mode. A model may be capable and aligned, yet still apply outdated medical guidance with perfect fluency. To test how much data freshness alone affects clinical accuracy, this study isolates the cutoff variable across two model families with different release patterns: OpenAI’s closed-weight GPT models and Meta’s open-weight LLaMA series. Using two dated versions of the Infectious Diseases Society of America (IDSA) COVID-19 Treatment and Management Guidelines (v5.0.0, August 25, 2021; v11.0.0, June 26 2023), we extracted recommendation-level differences and automatically generated 363 multiple-choice questions representing genuine shifts in therapeutic advice. Each model answered the same items under identical prompts and deterministic settings. Accuracy rose sharply only when the model’s presumed training window included the newer guideline. GPT-3.5-Turbo and LLaMA-2-13B, whose cutoffs predate June 2023, significantly lagged behind models whose knowledge cutoffs post-dated v11.0.0. GPT-4o, GPT-5, and LLaMA-3.3-70B, trained on fresher data, converged at over 90%. The consistency of this pattern across closed and open systems indicates that temporal coverage, not mere parameter count, drives gains in applied medical reasoning. These findings argue that model recency must be treated as a safety-critical attribute on par with alignment or interpretability.
Submission Number: 26
Loading