Can LLM Judges Reliably Distinguish Good and Bad Clinical Social Skills? A Persona-Grounded Benchmark in the Indian Clinical Context

Dhruv Awasthi; Shreya Gupta; Anirudh Sharma; Tulika Saha; Dinesh Babu Jayagopi

Can LLM Judges Reliably Distinguish Good and Bad Clinical Social Skills? A Persona-Grounded Benchmark in the Indian Clinical Context

Dhruv Awasthi, Shreya Gupta, Anirudh Sharma, Tulika Saha, Dinesh Babu Jayagopi

Published: 13 Jun 2026, Last Modified: 13 Jun 2026FSG 2026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: doctor-patient communication, clinical social skills, LLM-as-judge evaluation, persona-grounded benchmark, foundation models for healthcare, Indian clinical context

TL;DR: We benchmark nine foundation models on doctor social skills using a rubric-driven LLM-as-judge over 150 Indian doctor-patient conversations. Fluent empathetic models still fail at greeting, paraphrasing, and undesirable-persona fidelity.

Abstract: Doctor-patient communication involves more than medical correctness. Behaviours such as active listening, empathy, reassurance, and clear explanation play an important role in building patient trust and supporting clinical decision-making. However, most existing medical large language model (LLM) benchmarks focus primarily on factual reasoning and diagnostic capability, with limited evaluation of communication quality and behavioural consistency in multi-turn interactions. We introduce a persona-grounded benchmark for doctor-patient conversations in the Indian clinical context. We further develop a rubric-based evaluation framework with explicit four-level scoring and principle-adherence filtering across five dimensions: conversation initiation, responsiveness, empathy and emotional alignment, communication quality, and persona adherence. Using a synthetically generated benchmark of 150 persona-grounded doctor-patient conversations, we evaluate nine open-source and proprietary foundation models on their ability to distinguish desirable and intentionally poor clinical communication behaviours. The results show that several models assigned high behavioural scores even in conversations containing intentionally poor doctor personas, indicating difficulty in reliably separating socially aligned communication from persona-specific behaviour patterns. Overall, the benchmark highlights the importance of structured behavioural evaluation for clinical dialogue systems.

Paper Type: Long Paper

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 8

Loading