Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Medical Large Language Models (LLMs), Patient-Centered Applications, Safety Evaluation Framework, Benchmark Dataset
Abstract: Large Language Models (LLMs) in the medical domain have been primarily developed and validated for healthcare professionals, leaving a significant gap in patient-centered adaptation. As real-world patient use of these models poses safety risks, rigorous evaluation tailored for patient interaction scenarios becomes essential. To address this, we introduce \textbf{PatientSafeBench}, a novel benchmark assessing both the safety and utility of LLMs in patient-facing contexts. It comprises five categories and 25 subcategories, each representing critical aspects of LLM performance for patient use. We developed 500 evaluation queries grounded in real clinical cases, with scoring criteria reviewed by four medical professionals. We evaluated 11 different LLMs on PatientSafeBench using a multi-judge approach, scoring responses on a 10-point scale with hierarchical safety thresholds. The results reveal that no model met our safety criteria for patient use, with medical-specific LLMs surprisingly underperforming general-purpose models. All models showed consistent weaknesses in temporal relevance, transparency, personalization, and user engagement. These findings highlight the need for dedicated patient-centered benchmarks to ensure the safety and effectiveness of LLMs in patient-facing applications.
Track: 4. Clinical Informatics
Registration Id: 00000000000
Submission Number: 115
Loading