Abstract: Early disease diagnoses can dramatically improve patient outcomes by enabling timely interventions, yet traditional approaches rely on laboratory and imaging data that require clinical visits and incur significant costs and delays. In this study, we introduce MIMIC-SR-ICD11 (MIMIC Self-Report with ICD-11), a dataset that transforms EHR discharge notes from the MIMIC database into first-person patient narratives and standardizes every diagnoses using WHO ICD-11 codes. We benchmark three leading large language models on overall accuracy (Hit @1 and F1 variants), sensitivity to candidate list length and ordering, and robustness across diseases of varying prevalence. Our experiments show that simply shortening the candidate list does not yield proportional gains in accuracy, and F1 scores even fall below a random-guess baseline. By splitting diseases into ten frequency-based groups, we uncover an unexpected accuracy dip for the most common conditions. To explain this phenomenon, we introduce two lexical specificity metrics: disease frequency–medical vocabulary size (DF-MVS) and medical term exclusivity score (MTES). These metrics demonstrate that generic, non-distinctive terminology drives prediction bias. To support future advances, we release our dataset as a standardized benchmark for the development of specialized medical diagnostic models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Human-Centered NLP; Model Bias and Fairness; Data-Efficient Training; Interpretability and Analysis
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 4759
Loading