Keywords: Generative AI, Medical education, Cognitive Behavior, Realism Evaluation, Virtual Patients (VP)
Abstract: Large language models (LLMs) are increasingly used to simulate complex social and cognitive tasks, yet the behavioral regularities and heuristics they employ remain underexplored. In this study, we investigate GPT-4o’s behavioral patterns when performing the cognitively demanding task of Virtual patients (VPs) in clinical interviews. VPs offer a promising alternative to traditional tools, but their conversational realism remains underexplored. Using 44 structured illness-script prompts spanning 17 clinical categories, we analyze the model’s output through expert review, Turing-style discrimination testing, linguistic profiling, and semantic similarity analysis. Expert annotations of hallucinations, omissions, and repetitions showed high interrater reliability ($ICC > 0.77$). In a Turing test, participants struggled to distinguish VPs from real patients—classification accuracy fell below chance. Linguistic analysis of 2,000+ dialogue turns revealed that VPs produced formal, lexically consistent responses, while human patients showed more emotional and stylistic variability. BioClinicalBERT-based semantic similarity scores averaged 0.871 (response-level) and 0.842 (transcript-level), indicating strong alignment. This behavioral characterization contributes to understanding how LLMs generalize to cognitively complex, open-ended interaction tasks and provides a reproducible evaluation framework for studying model behaviors in socially and domain-specific contexts.
Submission Number: 2
Loading