Can You Spot the Virtual Patient (VP)? Expert Evaluation, Turing Test, Linguistic Analysis, and Semantic Similarity Analysis
Keywords: Generative AI, medical education, 30 virtual patient, realism evaluation
Abstract: Communication is a critical clinical skill, yet scalable, realistic training tools remain limited. Large language model (LLM)-based virtual patients (VPs) offer a promising alternative to traditional tools, but their conversational realism remains underexplored. In this study, we evaluate the realism of GPT-4o-generated VPs using a multi-method approach: expert review, Turing-style testing, linguistic analysis, and semantic similarity. We generated 44 VPs based on real doctor–patient dialogues. Expert annotations of hallucinations, omissions, and repetitions showed high interrater reliability ($ICC > 0.77$). In a Turing test, participants struggled to distinguish VPs from real patients—classification accuracy fell below chance. Linguistic analysis of 2,000+ dialogue turns revealed that VPs produced formal, lexically consistent responses, while human patients showed more emotional and stylistic variability. Semantic similarity scores averaged 0.871 (response-level) and 0.842 (transcript-level), indicating strong alignment. These findings support the use of LLM-based VPs in communication training and offer insights into realism, trust, and refinement, contributing to the safe and responsible deployment of generative AI in healthcare.
Submission Number: 5
Loading